Chain of Thoughtlessness? An Analysis of CoT in Planning
CoRR(2024)
摘要
Large language model (LLM) performance on reasoning problems typically does
not generalize out of distribution. Previous work has claimed that this can be
mitigated with chain of thought prompting-a method of demonstrating solution
procedures-with the intuition that it is possible to in-context teach an LLM an
algorithm for solving the problem. This paper presents a case study of chain of
thought on problems from Blocksworld, a classical planning domain, and examines
the performance of two state-of-the-art LLMs across two axes: generality of
examples given in prompt, and complexity of problems queried with each prompt.
While our problems are very simple, we only find meaningful performance
improvements from chain of thought prompts when those prompts are exceedingly
specific to their problem class, and that those improvements quickly
deteriorate as the size n of the query-specified stack grows past the size of
stacks shown in the examples. We also create scalable variants of three domains
commonly studied in previous CoT papers and demonstrate the existence of
similar failure modes. Our results hint that, contrary to previous claims in
the literature, CoT's performance improvements do not stem from the model
learning general algorithmic procedures via demonstrations but depend on
carefully engineering highly problem specific prompts. This spotlights
drawbacks of chain of thought, especially the sharp tradeoff between possible
performance gains and the amount of human labor necessary to generate examples
with correct reasoning traces.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要