TaskBench: Benchmarking Large Language Models for Task Automation
CoRR(2023)
摘要
Recently, the incredible progress of large language models (LLMs) has ignited
the spark of task automation, which decomposes the complex tasks described by
user instructions into sub-tasks, and invokes external tools to execute them,
and plays a central role in autonomous agents. However, there lacks a
systematic and standardized benchmark to foster the development of LLMs in task
automation. To this end, we introduce TaskBench to evaluate the capability of
LLMs in task automation. Specifically, task automation can be formulated into
three critical stages: task decomposition, tool invocation, and parameter
prediction to fulfill user intent. This complexity makes data collection and
evaluation more challenging compared to common NLP tasks. To generate
high-quality evaluation datasets, we introduce the concept of Tool Graph to
represent the decomposed tasks in user intent, and adopt a back-instruct method
to simulate user instruction and annotations. Furthermore, we propose TaskEval
to evaluate the capability of LLMs from different aspects, including task
decomposition, tool invocation, and parameter prediction. Experimental results
demonstrate that TaskBench can effectively reflects the capability of LLMs in
task automation. Benefiting from the mixture of automated data construction and
human verification, TaskBench achieves a high consistency compared to the human
evaluation, which can be utilized as a comprehensive and faithful benchmark for
LLM-based autonomous agents.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要