Mitigating Long-Tail Language Representation Collapsing via Cross-Lingual Bootstrapped Unsupervised Fine-Tuning
ECAI 2023(2023)
摘要
Large Language Models have shown great capability to comprehend natural language and provide reasonable responses. However, previous researches have shown weak performance of these models on low-resource (long-tail) languages. It remains to be a problem to mitigate the performance gap between long-tail languages and rich-resource ones, which is referred to as long-tail language representation collapsing. Though some previous works can generate pseudo-parallel corpora with the auto-regressive generation, this generation progress is time-consuming and remains low quality, particularly for long-tail languages. In this paper, we propose a (X) Cross-lingual Bootstrapped Unsupervised Fine-tuning Framework (X-BUFF) to mitigate long-tail language representation collapsing. X-BUFF iteratively updates cross-lingual PLMs in a curriculum way. In each iteration of X-BUFF, we (1) select sentences with complementary semantics from monolingual corpora in long-tail languages. (2) match these selected sentences with semantic equivalent sentences in many other languages to create parallel sentence pairs, which we then merge with previous sentence pairs to build a larger and more difficult bootstrapped parallel queue. (3) fine-tune the PLMs with the bootstrapped parallel queue. Extensive experiments show that X-BUFF can mitigate the long-tail language representation collapsing problem in cross-lingual PLMs and achieve significant improvements over the previous baselines on several cross-lingual evaluation benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要