The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
arxiv(2023)
Abstract
This study investigates the consequences of training language models on
synthetic data generated by their predecessors, an increasingly prevalent
practice given the prominence of powerful generative models. Diverging from the
usual emphasis on performance metrics, we focus on the impact of this training
methodology on linguistic diversity, especially when conducted recursively over
time. To assess this, we adapt and develop a set of novel metrics targeting
lexical, syntactic, and semantic diversity, applying them in recursive
finetuning experiments across various natural language generation tasks in
English. Our findings reveal a consistent decrease in the diversity of the
model outputs through successive iterations, especially remarkable for tasks
demanding high levels of creativity. This trend underscores the potential risks
of training language models on synthetic text, particularly concerning the
preservation of linguistic richness. Our study highlights the need for careful
consideration of the long-term effects of such training approaches on the
linguistic capabilities of language models.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined