Sub-word based unsupervised bilingual dictionary induction for Chinese-Uyghur

2022 International Conference on Asian Language Processing (IALP)(2022)

引用 1|浏览1
暂无评分
摘要
In this paper, we focus on the task of bilingual dictionary induction for the Chinese-Uyghur language pair. Usually, correlating long-distance linguistic information requires cross-linguistic information as supervision, which often requires parallel corpora to link in seed lexicons. And the parallel corpora are expensive. The low-resource Uyghur language text data are only available in a small amount, and the derivative morphological structure is vibrant and complex. In bilingual processing aligning most similar units and entity stems is the first step. So separating sentences into morpheme sequences is essential in the cross-lingual processing tasks. Uyghur words in text sentences consist of stems joined with several suffixes/prefixes. Rich and complex multiple affix forms exist in the text, forming many derivative words. This situation can easily lead to an increase in the repetition rate of intentional features in the text, which affects the efficiency of bilingual dictionary extraction. In this work, we actively explore the resource construction and granularity optimization of minority low-resource languages and learn cross-language word embeddings without the supervision of parallel data. A Chinese-Uyghur bilingual dictionary extraction method is proposed based on the neural network cross-language word embedding vector technology and the multilingual morphological analyzer. Experiments show that the way based on morpheme sequence significantly improved compared to the baseline model of the word sequence.
更多
查看译文
关键词
bilingual dictionary,unsupervised learning,seed dictionary,morpheme sequence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要