MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
CVPR 2024(2023)
摘要
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and
deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal
questions from college exams, quizzes, and textbooks, covering six core
disciplines: Art & Design, Business, Science, Health & Medicine, Humanities &
Social Science, and Tech & Engineering. These questions span 30 subjects and
183 subfields, comprising 30 highly heterogeneous image types, such as charts,
diagrams, maps, tables, music sheets, and chemical structures. Unlike existing
benchmarks, MMMU focuses on advanced perception and reasoning with
domain-specific knowledge, challenging models to perform tasks akin to those
faced by experts. Our evaluation of 14 open-source LMMs and the proprietary
GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the
advanced GPT-4V only achieves a 56% accuracy, indicating significant room for
improvement. We believe MMMU will stimulate the community to build
next-generation multimodal foundation models towards expert artificial general
intelligence.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要