DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks
arxiv(2023)
摘要
Large language models (LLMs) have achieved remarkable performance in various
evaluation benchmarks. However, concerns are raised about potential data
contamination in their considerable volume of training corpus. Moreover, the
static nature and fixed complexity of current benchmarks may inadequately gauge
the advancing capabilities of LLMs. In this paper, we introduce DyVal, a
general and flexible protocol for dynamic evaluation of LLMs. Based on our
framework, we build graph-informed DyVal by leveraging the structural advantage
of directed acyclic graphs to dynamically generate evaluation samples with
controllable complexities. DyVal generates challenging evaluation sets on
reasoning tasks including mathematics, logical reasoning, and algorithm
problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo
and GPT-4. Experiments show that LLMs perform worse in DyVal-generated
evaluation samples with different complexities, highlighting the significance
of dynamic evaluation. We also analyze the failure cases and results of
different prompting methods. Moreover, DyVal-generated samples are not only
evaluation sets, but also helpful data for fine-tuning to improve the
performance of LLMs on existing benchmarks. We hope that DyVal can shed light
on future evaluation research of LLMs. Code is available at:
https://github.com/microsoft/promptbench.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要