Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
CoRR(2023)
摘要
In this paper, we present an innovative process-oriented math process reward
model called Math-Shepherd, which assigns a reward score to each step
of math problem solutions. The training of Math-Shepherd is achieved using
automatically constructed process-wise supervision data, breaking the
bottleneck of heavy reliance on manual annotation in existing work. We explore
the effectiveness of Math-Shepherd in two scenarios: 1) Verification:
Math-Shepherd is utilized for reranking multiple outputs generated by Large
Language Models (LLMs); 2) Reinforcement Learning: Math-Shepherd is
employed to reinforce LLMs with step-by-step Proximal Policy Optimization
(PPO). With Math-Shepherd, a series of open-source LLMs demonstrates
exceptional performance. For instance, the step-by-step PPO with Math-Shepherd
significantly improves the accuracy of Mistral-7B (77.9%→84.1% on GSM8K
and 28.6%→33.0% on MATH). The accuracy can be further enhanced to 89.1%
and 43.5% on GSM8K and MATH with the verification of Math-Shepherd,
respectively. We believe that automatic process supervision holds significant
potential for the future evolution of LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要