Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level
arxiv(2024)
摘要
Prominent large language models have exhibited human-level performance in
many domains, even enabling the derived agents to simulate human and social
interactions. While practical works have substantiated the practicability of
grounding language agents in sandbox simulation or embodied simulators, current
social intelligence benchmarks either stay at the language level or use
subjective metrics. In pursuit of a more realistic and objective evaluation, we
introduce the Social Tasks in Sandbox Simulation (STSS) benchmark, which
assesses language agents objectively at the action level by
scrutinizing the goal achievements within the multi-agent simulation.
Additionally, we sample conversation scenarios to build a language-level
benchmark to provide an economically prudent preliminary evaluation and align
with prevailing benchmarks. To gauge the significance of agent architecture, we
implement a target-driven planning (TDP) module as an adjunct to the existing
agent. Our evaluative findings highlight that the STSS benchmark is challenging
for state-of-the-art language agents. Furthermore, it effectively discriminates
between distinct language agents, suggesting its usefulness as a benchmark for
evaluating both language models and agent architectures.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要