Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
CoRR(2024)
摘要
The rapid rise in popularity of Large Language Models (LLMs) with emerging
capabilities has spurred public curiosity to evaluate and compare different
LLMs, leading many researchers to propose their LLM benchmarks. Noticing
preliminary inadequacies in those benchmarks, we embarked on a study to
critically assess 23 state-of-the-art LLM benchmarks, using our novel unified
evaluation framework through the lenses of people, process, and technology,
under the pillars of functionality and security. Our research uncovered
significant limitations, including biases, difficulties in measuring genuine
reasoning, adaptability, implementation inconsistencies, prompt engineering
complexity, evaluator diversity, and the overlooking of cultural and
ideological norms in one comprehensive assessment. Our discussions emphasized
the urgent need for standardized methodologies, regulatory certainties, and
ethical guidelines in light of Artificial Intelligence (AI) advancements,
including advocating for an evolution from static benchmarks to dynamic
behavioral profiling to accurately capture LLMs' complex behaviors and
potential risks. Our study highlighted the necessity for a paradigm shift in
LLM evaluation methodologies, underlining the importance of collaborative
efforts for the development of universally accepted benchmarks and the
enhancement of AI systems' integration into society.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要