HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs
Conference of the European Chapter of the Association for Computational Linguistics(2024)
摘要
Hallucinations pose a significant challenge to the reliability and alignment
of Large Language Models (LLMs), limiting their widespread acceptance beyond
chatbot applications. Despite ongoing efforts, hallucinations remain a
prevalent challenge in LLMs. The detection of hallucinations itself is also a
formidable task, frequently requiring manual labeling or constrained
evaluations. This paper introduces an automated scalable framework that
combines benchmarking LLMs' hallucination tendencies with efficient
hallucination detection. We leverage LLMs to generate challenging tasks related
to hypothetical phenomena, subsequently employing them as agents for efficient
hallucination detection. The framework is domain-agnostic, allowing the use of
any language model for benchmark creation or evaluation in any domain. We
introduce the publicly available HypoTermQA Benchmarking Dataset, on which
state-of-the-art models' performance ranged between 3
agents demonstrated a 6
framework provides opportunities to test and improve LLMs. Additionally, it has
the potential to generate benchmarking datasets tailored to specific domains,
such as law, health, and finance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要