MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

David I. Adelani,Dana Ruiter,Jesujoba O. Alabi,Damilola Adebonojo,Adesina Ayeni,Mofe Adeyemi,Ayodele Awokoya,Cristina Espana-Bonet

arxiv（2021）

引用 2|浏览14

暂无评分

摘要

Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due the lack of standardized evaluation datasets. In this paper, we present MENYO20k, the first multi-domain parallel corpus for the low-resource Yorùbá–English (yo–en) language pair with standardized train-test splits for benchmarking. We provide several neural MT (NMT) benchmarks on this dataset and compare to the performance of popular pre-trained (massively multilingual) MT models, showing that, in almost all cases, our simple benchmarks outperform the pre-trained MT models. A major gain of BLEU +9.9 and +8.6 (en2yo) is achieved in comparison to Facebook’s M2M-100 and Google multilingual NMT respectively when we use MENYO20k to fine-tune generic models.

查看译文

关键词

machine translation,multi-domain adaptation,corpus,english-yor

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要