ATG: Benchmarking Automated Theorem Generation for Generative Language Models
arxiv(2024)
摘要
Humans can develop new theorems to explore broader and more complex
mathematical results. While current generative language models (LMs) have
achieved significant improvement in automatically proving theorems, their
ability to generate new or reusable theorems is still under-explored. Without
the new theorems, current LMs struggle to prove harder theorems that are
distant from the given hypotheses with the exponentially growing search space.
Therefore, this paper proposes an Automated Theorem Generation (ATG) benchmark
that evaluates whether an agent can automatically generate valuable (and
possibly brand new) theorems that are applicable for downstream theorem proving
as reusable knowledge. Specifically, we construct the ATG benchmark by
splitting the Metamath library into three sets: axioms, library, and problem
based on their proving depth. We conduct extensive experiments to investigate
whether current LMs can generate theorems in the library and benefit the
problem theorems proving. The results demonstrate that high-quality ATG data
facilitates models' performances on downstream ATP. However, there is still
room for current LMs to develop better ATG and generate more advanced and
human-like theorems. We hope the new ATG challenge can shed some light on
advanced complex theorem proving.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要