Marathon: A Race Through the Realm of Long Context with Large Language Models
arxiv(2023)
摘要
With the advancement of large language models (LLMs) and the expansion of
their context windows, existing long-context benchmarks fall short in
effectively evaluating the models' comprehension and reasoning abilities in
extended texts. Moreover, conventional benchmarks relying on F1 metrics often
inaccurately score responses: they may undervalue correct answers that differ
from the reference responses and overvalue incorrect ones that resemble the
reference texts. In response to these limitations, we introduce Marathon, a
novel evaluation benchmark that adopts a multiple-choice question format. It is
specifically designed to overcome the constraints of previous benchmarks and
provide a rapid, precise, and unbiased appraisal of the long-context
comprehension skills of large language models. We conducted comprehensive
evaluations on the Marathon benchmark with a range of state-of-the-art LLMs and
assessed the effectiveness of various optimization strategies tailored for
long-context generation. We anticipate that the Marathon benchmark and its
associated leaderboard will enable a more precise and equitable evaluation of
LLMs' capabilities in understanding and reasoning over extended contexts.
Marathon is available at https://github.com/Hambaobao/Marathon.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要