On the Evaluation of Machine-Generated Reports
arxiv(2024)
摘要
Large Language Models (LLMs) have enabled new ways to satisfy information
needs. Although great strides have been made in applying them to settings like
document ranking and short-form text generation, they still struggle to compose
complete, accurate, and verifiable long-form reports. Reports with these
qualities are necessary to satisfy the complex, nuanced, or multi-faceted
information needs of users. In this perspective paper, we draw together
opinions from industry and academia, and from a variety of related research
areas, to present our vision for automatic report generation, and – critically
– a flexible framework by which such reports can be evaluated. In contrast
with other summarization tasks, automatic report generation starts with a
detailed description of an information need, stating the necessary background,
requirements, and scope of the report. Further, the generated reports should be
complete, accurate, and verifiable. These qualities, which are desirable – if
not required – in many analytic report-writing settings, require rethinking
how to build and evaluate systems that exhibit these qualities. To foster new
efforts in building these systems, we present an evaluation framework that
draws on ideas found in various evaluations. To test completeness and accuracy,
the framework uses nuggets of information, expressed as questions and answers,
that need to be part of any high-quality generated report. Additionally,
evaluation of citations that map claims made in the report to their source
documents ensures verifiability.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要