GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations
CoRR(2024)
摘要
As Large Language Models (LLMs) are integrated into critical real-world
applications, their strategic and logical reasoning abilities are increasingly
crucial. This paper evaluates LLMs' reasoning abilities in competitive
environments through game-theoretic tasks, e.g., board and card games that
require pure logic and strategic reasoning to compete with opponents. We first
propose GTBench, a language-driven environment composing 10 widely-recognized
tasks, across a comprehensive game taxonomy: complete versus incomplete
information, dynamic versus static, and probabilistic versus deterministic
scenarios. Then, we investigate two key problems: (1) Characterizing
game-theoretic reasoning of LLMs; (2) LLM-vs-LLM competitions as reasoning
evaluation. We observe that (1) LLMs have distinct behaviors regarding various
gaming scenarios; for example, LLMs fail in complete and deterministic games
yet they are competitive in probabilistic gaming scenarios; (2) Open-source
LLMs, e.g., CodeLlama-34b-Instruct, are less competitive than commercial LLMs,
e.g., GPT-4, in complex games. In addition, code-pretraining greatly benefits
strategic reasoning, while advanced reasoning methods such as Chain-of-Thought
(CoT) and Tree-of-Thought (ToT) do not always help. Detailed error profiles are
also provided for a better understanding of LLMs' behavior.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要