Multi-dimensional data refining strategy for effective fine-tuning LLMs
CoRR(2023)
摘要
Data is a cornerstone for fine-tuning large language models, yet acquiring
suitable data remains challenging. Challenges encompassed data scarcity,
linguistic diversity, and domain-specific content. This paper presents lessons
learned while crawling and refining data tailored for fine-tuning Vietnamese
language models. Crafting such a dataset, while accounting for linguistic
intricacies and striking a balance between inclusivity and accuracy, demands
meticulous planning. Our paper presents a multidimensional strategy including
leveraging existing datasets in the English language and developing customized
data-crawling scripts with the assistance of generative AI tools. A fine-tuned
LLM model for the Vietnamese language, which was produced using resultant
datasets, demonstrated good performance while generating Vietnamese news
articles from prompts. The study offers practical solutions and guidance for
future fine-tuning models in languages like Vietnamese.
更多查看译文
关键词
multi-dimensional,fine-tuning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要