谷歌浏览器插件
订阅小程序
在清言上使用

Towards mining bilingual lexicons and parallel phrases from large-scale monolingual corpora

Shilong Wu, Xu Wang, Qiuyi Ning,Shigui Qiu

2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)(2021)

引用 0|浏览4
暂无评分
摘要
Bilingual lexicons and parallel phrases have a great effect on certain tasks of natural language processing (NLP). Recent researches have proved that the high-quality bilingual lexicons can hence the performance of the machine translation. When it comes to some special tasks of NLP, the incorporation of bilingual lexicons can bring about obvious effectiveness. The bilingual lexicons and parallel phrases can be easily extracted from parallel corpora, but in contrast to the monolingual corpora, the number of parallel corpora is still scarce. Actually, the monolingual corpora also have the potential to mine a large amount of parallel word and phrase pairs. In this paper, we propose two strategies to extract parallel words and phrases from monolingual corpora. On one hand, we present the indirect mining strategy, Anchored Mining (AM), which injects the anchoring point into each mining procedure to improve the accuracy. On the other hand, inspired by the process of humans learning a foreign language, we further propose another novel, direct algorithm named Bootstrapping Mining (BM), which mimics the human learning process and aims to learn parallel phrases automatically in a self-iterative way. Additionally, we propose a novel metric, phrase probability-sub item average probability (PP-SAP), which is applied to quantitatively evaluate the rationality of each extracted parallel phrase pair in the monolingual corpora. We conduct the experiments on large-scale English-Chinese, English-Russia, and English-France monolingual corpora, and the results show that our methods can mine high-quality bilingual lexicons and parallel phrases. We also evaluate our algorithms on low-resource monolingual corpora and get good results as well.
更多
查看译文
关键词
towards mining bilingual lexicons,parallel phrases,high-quality bilingual lexicons,parallel corpora,parallel word,phrase pairs,parallel words,phrase probability-sub item average probability,extracted parallel phrase pair,English-France monolingual corpora,low-resource monolingual corpora
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要