Can We Verify Step by Step for Incorrect Answer Detection?
CoRR(2024)
Abstract
Chain-of-Thought (CoT) prompting has marked a significant advancement in
enhancing the reasoning capabilities of large language models (LLMs). Previous
studies have developed various extensions of CoT, which focus primarily on
enhancing end-task performance. In addition, there has been research on
assessing the quality of reasoning chains in CoT. This raises an intriguing
question: Is it possible to predict the accuracy of LLM outputs by scrutinizing
the reasoning chains they generate? To answer this research question, we
introduce a benchmark, R2PE, designed specifically to explore the relationship
between reasoning chains and performance in various reasoning tasks spanning
five different domains. This benchmark aims to measure the falsehood of the
final output of LLMs based on the reasoning steps. To make full use of
information in multiple reasoning chains, we propose the process discernibility
score (PDS) framework that beats the answer-checking baseline by a large
margin. Concretely, this resulted in an average of 5.1
score across all 45 subsets within R2PE. We further demonstrate our PDS's
efficacy in advancing open-domain QA accuracy. Data and code are available at
https://github.com/XinXU-USTC/R2PE.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined