One Thousand and One Pairs: A "novel" challenge for long-context language models
arxiv(2024)
摘要
Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test
only surface-level retrieval capabilities, but how well can long-context LLMs
retrieve, synthesize, and reason over information across book-length inputs? We
address this question by creating NoCha, a dataset of 1,001 minimally different
pairs of true and false claims about 67 recently-published English fictional
books, written by human readers of those books. In contrast to existing
long-context benchmarks, our annotators confirm that the largest share of pairs
in NoCha require global reasoning over the entire book to verify. Our
experiments show that while human readers easily perform this task, it is
enormously challenging for all ten long-context LLMs that we evaluate: no
open-weight model performs above random chance (despite their strong
performance on synthetic benchmarks), while GPT-4o achieves the highest
accuracy at 55.8
much better on pairs that require only sentence-level retrieval vs. global
reasoning; (2) model-generated explanations for their decisions are often
inaccurate even for correctly-labeled claims; and (3) models perform
substantially worse on speculative fiction books that contain extensive
world-building. The methodology proposed in NoCha allows for the evolution of
the benchmark dataset and the easy analysis of future models.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要