Mitigating Long-Tail Language Representation Collapsing via Cross-Lingual Bootstrapped Unsupervised Fine-Tuning

ECAI 2023(2023)

引用 0|浏览21
暂无评分
摘要
Large Language Models have shown great capability to comprehend natural language and provide reasonable responses. However, previous researches have shown weak performance of these models on low-resource (long-tail) languages. It remains to be a problem to mitigate the performance gap between long-tail languages and rich-resource ones, which is referred to as long-tail language representation collapsing. Though some previous works can generate pseudo-parallel corpora with the auto-regressive generation, this generation progress is time-consuming and remains low quality, particularly for long-tail languages. In this paper, we propose a (X) Cross-lingual Bootstrapped Unsupervised Fine-tuning Framework (X-BUFF) to mitigate long-tail language representation collapsing. X-BUFF iteratively updates cross-lingual PLMs in a curriculum way. In each iteration of X-BUFF, we (1) select sentences with complementary semantics from monolingual corpora in long-tail languages. (2) match these selected sentences with semantic equivalent sentences in many other languages to create parallel sentence pairs, which we then merge with previous sentence pairs to build a larger and more difficult bootstrapped parallel queue. (3) fine-tune the PLMs with the bootstrapped parallel queue. Extensive experiments show that X-BUFF can mitigate the long-tail language representation collapsing problem in cross-lingual PLMs and achieve significant improvements over the previous baselines on several cross-lingual evaluation benchmarks.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要