VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
arxiv(2024)
摘要
Existing metrics for evaluating the factuality of long-form text, such as
FACTSCORE (Min et al., 2023) and SAFE (Wei et al., 2024), decompose an input
text into "atomic claims" and verify each against a knowledge base like
Wikipedia. These metrics are not suitable for most generation tasks because
they assume that every claim is verifiable (i.e., can plausibly be proven true
or false). We address this issue with VERISCORE, a metric for diverse long-form
generation tasks that contain both verifiable and unverifiable content.
VERISCORE can be effectively implemented with either closed or fine-tuned
open-weight language models, and human evaluation confirms that VERISCORE's
extracted claims are more sensible than those from competing methods across
eight different long-form tasks. We use VERISCORE to evaluate generations from
16 different models across multiple long-form tasks and find that while GPT-4o
is the best-performing model overall, open-weight models such as Mixtral-8x22
are closing the gap. We show that an LM's VERISCORE on one task (e.g.,
biography generation) does not necessarily correlate to its VERISCORE on a
different task (e.g., long-form QA), highlighting the need for expanding
factuality evaluation across tasks with varying fact density.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要