Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets
arxiv(2024)
摘要
Advancements in Large Language Models (LLMs) have significantly enhanced
instruction-following capabilities. However, most Instruction Fine-Tuning (IFT)
datasets are predominantly in English, limiting model performance in other
languages. Traditional methods for creating multilingual IFT datasets such as
translating existing English IFT datasets or converting existing NLP datasets
into IFT datasets by templating, struggle to capture linguistic nuances and
ensure prompt (instruction) diversity. To address this issue, we propose a
novel method for collecting multilingual IFT datasets that preserves linguistic
naturalness and ensures prompt diversity. This approach leverages
English-focused LLMs, monolingual corpora, and a scoring function to create
high-quality, diversified IFT datasets in multiple languages. Experiments
demonstrate that LLMs finetuned using these IFT datasets show notable
improvements in both generative and discriminative tasks, indicating enhanced
language comprehension by LLMs in non-English contexts. Specifically, on the
multilingual summarization task, LLMs using our IFT dataset achieved 17.57
15.23
template-based datasets, respectively.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要