Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
CoRR(2024)
摘要
Recent statements about the impressive capabilities of large language models
(LLMs) are usually supported by evaluating on open-access benchmarks.
Considering the vast size and wide-ranging sources of LLMs' training data, it
could explicitly or implicitly include test data, leading to LLMs being more
susceptible to data contamination. However, due to the opacity of training
data, the black-box access of models, and the rapid growth of synthetic
training data, detecting and mitigating data contamination for LLMs faces
significant challenges. In this paper, we propose CDD, which stands for
Contamination Detection via output Distribution for LLMs. CDD necessitates only
the sampled texts to detect data contamination, by identifying the peakedness
of LLM's output distribution. To mitigate the impact of data contamination in
evaluation, we also present TED: Trustworthy Evaluation via output
Distribution, based on the correction of LLM's output distribution. To
facilitate this study, we introduce two benchmarks, i.e., DetCon and ComiEval,
for data contamination detection and contamination mitigation evaluation tasks.
Extensive experimental results show that CDD achieves the average relative
improvements of 21.8%-30.2% over other contamination detection approaches in
terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect
contamination caused by the variants of test data. TED significantly mitigates
performance improvements up to 66.9% attributed to data contamination across
24 settings and 21 contamination degrees. In real-world applications, we reveal
that ChatGPT exhibits a high potential to suffer from data contamination on
HumanEval benchmark.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要