Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop
CoRR(2023)
摘要
Large language models (LLM) have become state of the art in many benchmarks
and conversational LLM applications like ChatGPT are now widely used by the
public. Those LLMs can be used to generate large amounts of content which is
posted on the internet to various platforms. As LLMs are trained on datasets
usually collected from the internet, this LLM-generated content might be used
to train the next generation of LLMs. Therefore, a self-consuming training loop
emerges in which new LLM generations are trained on the output from the
previous generations. We empirically study this self-consuming training loop
using a novel dataset to analytically and accurately measure quality and
diversity of generated outputs. We find that this self-consuming training loop
initially improves both quality and diversity. However, after a few generations
the output inevitably degenerates in diversity. We find that the rate of
degeneration depends on the proportion of real and generated data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要