Beyond Reference-Based Metrics: Analyzing Behaviors of Open LLMs on Data-to-Text Generation
CoRR(2024)
摘要
We investigate to which extent open large language models (LLMs) can generate
coherent and relevant text from structured data. To prevent bias from
benchmarks leaked into LLM training data, we collect Quintd-1: an ad-hoc
benchmark for five data-to-text (D2T) generation tasks, consisting of
structured data records in standard formats gathered from public APIs. We
leverage reference-free evaluation metrics and LLMs' in-context learning
capabilities, allowing us to test the models with no human-written references.
Our evaluation focuses on annotating semantic accuracy errors on token-level,
combining human annotators and a metric based on GPT-4. Our systematic
examination of the models' behavior across domains and tasks suggests that
state-of-the-art open LLMs with 7B parameters can generate fluent and coherent
text from various standard data formats in zero-shot settings. However, we also
show that semantic accuracy of the outputs remains a major issue: on our
benchmark, 80
human annotators (91
are available at https://d2t-llm.github.io.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要