Chinese-Uyghur Bilingual Lexicon Induction Based on Morpheme Sequence and Weak Supervision

Anwar Aysa, Mijit Ablimit,Hankiz Yilahun, Askar Hamdulla

2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML)(2022)

引用 0|浏览8
暂无评分
摘要
The bilingual dictionary is a vital data resource for machine translation and cross-language information retrieval research. Uyghur language has rich derivative forms, in which words are formed by a stem connecting with several suffixes, thus a large number of new words can be generated. This will increase the repetition rate of intentional features in the text and affect the efficiency of bilingual dictionary extraction. Aiming at the poor alignment of Chinese-Uyghur cross-language word embeddings due to significant morphological differences, this paper proposes a multilingual morphological analyzer based on morpheme sequence combined with neural network cross-language word embedding vector mapping and used for Chinese-Uyghur bilingual dictionary extraction task. A robust morpheme segmentation and stemming of bilingual text data are used to obtain excellent and meaningful word semantic features. Using a small number of Chinese-Uyghur parallel seed dictionaries as weakly supervised signals, respectively, map multilingual word or morpheme vectors to a unified vector space. And by associative alignment and locally scaling two bi-lingual retrievals through nearest-neighbor retrieval and cross-domain similarity, bilingual dictionaries are automatically extracted. Experimental results show that the morpheme sequence-based method for the Chinese-Uyghur dictionary induction task has significantly improved the accuracy of dictionary alignment compared to the word-based model. The manner in this paper can efficiently improve the accuracy of bilingual word alignment and is effective for morphologically derivative languages.
更多
查看译文
关键词
bilingual dictionary,inter-word relationship matrix,seed dictionary,morpheme sequence
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要