Wasf-Vec: Topology-based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)(2020)

引用 0|浏览23
暂无评分
摘要
Word clustering is a serious challenge in low-resource languages. Since words that share semantics are expected to be clustered together, it is common to use a feature vector representation generated from a distributional theory-based word embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low-resource Iraqi vocabulary. We began with a new Dialect Fast Stemming Algorithm (DFSA) that utilizes the MSA data. The proposed algorithm achieved 0.85 accuracy measured by the F1 score. Then, the distributional theory-based word embedding method and a new simple, yet effective, feature vector named Wasf-Vec word embedding are tested. Wasf-Vec word representation utilizes a word’s topology features. The difference between Wasf-Vec and distributional theory-based word embedding is that Wasf-Vec captures relations that are not contextually based. The embedding is followed by an analysis of how the dialect words are clustered within other MSA words. The analysis is based on the word semantic relations that are well supported by solid linguistic theories to shed light on the strong and weak word relation representations identified by each embedding method. The analysis is handled by visualizing the feature vector in two-dimensional (2D) space. The feature vectors of the distributional theory-based word embedding method are plotted in 2D space using the t-sne algorithm, while the Wasf-Vec feature vectors are plotted directly in 2D space. A word’s nearest neighbors and the distance-histograms of the plotted words are examined. For validation purpose of the word classification used in this article, the produced classes are employed in Class-based Language Modeling (CBLM). Wasf-Vec CBLM achieved a 7% lower perplexity (pp) than the distributional theory-based word embedding method CBLM. This result is significant when working with low-resource languages.
更多
查看译文
关键词
2D visualizing,Arabic language,Topology,Word embedding,class-based language modeling,dialect,morphology,orthographic,phonology,words classification,words features,words ontology
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要