GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
CoRR(2024)
摘要
Recent advancements in Large Language Models (LLMs) and Multi-Modal Models
(MMs) have demonstrated their remarkable capabilities in problem-solving. Yet,
their proficiency in tackling geometry math problems, which necessitates an
integrated understanding of both textual and visual information, has not been
thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark,
a comprehensive collection that includes a main subset of 2000 problems, a 750
problem subset focusing on backward reasoning, an augmented subset of 2000
problems, and a hard subset of 300 problems. This benchmark facilitates a
deeper investigation into the performance of LLMs and MMs on solving geometry
math problems. Our evaluation of ten LLMs and MMs across these varied subsets
reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on
the main subset but only a 6.00% accuracy on the challenging subset. This
highlights the critical need for testing models against datasets on which they
have not been pre-trained. Additionally, our findings indicate that GPT-series
models perform more effectively on problems they have rephrased, suggesting a
promising method for enhancing model capabilities.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要