How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library
arxiv(2024)
摘要
With the rise of Large Language Models (LLMs) in recent years, new
opportunities are emerging, but also new challenges, and contamination is
quickly becoming critical. Business applications and fundraising in AI have
reached a scale at which a few percentage points gained on popular
question-answering benchmarks could translate into dozens of millions of
dollars, placing high pressure on model integrity. At the same time, it is
becoming harder and harder to keep track of the data that LLMs have seen; if
not impossible with closed-source models like GPT-4 and Claude-3 not divulging
any information on the training set. As a result, contamination becomes a
critical issue: LLMs' performance may not be reliable anymore, as the high
performance may be at least partly due to their previous exposure to the data.
This limitation jeopardizes the entire progress in the field of NLP, yet, there
remains a lack of methods on how to efficiently address contamination, or a
clear consensus on prevention, mitigation and classification of contamination.
In this paper, we survey all recent work on contamination with LLMs, and help
the community track contamination levels of LLMs by releasing an open-source
Python library named LLMSanitize implementing major contamination detection
algorithms, which link is: https://github.com/ntunlp/LLMSanitize.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要