MULTI: Multimodal Understanding Leaderboard with Text and Images
CoRR(2024)
摘要
Rapid progress in multimodal large language models (MLLMs) highlights the
need to introduce challenging yet realistic benchmarks to the academic
community, while existing benchmarks primarily focus on understanding simple
natural images and short context. In this paper, we present MULTI as a
cutting-edge benchmark for evaluating MLLMs on understanding complex tables and
images, and reasoning with long context. MULTI provides multimodal inputs and
requires responses that are either precise or open-ended, reflecting real-life
examination styles. MULTI includes over 18,000 questions and challenges MLLMs
with a variety of tasks, ranging from formula derivation to image detail
analysis and cross-modality reasoning. We also introduce MULTI-Elite, a
500-question selected hard subset, and MULTI-Extend, with more than 4,500
external knowledge context pieces. Our evaluation indicates significant
potential for MLLM advancement, with GPT-4V achieving a 63.7
MULTI, in contrast to other MLLMs scoring between 28.5
not only as a robust evaluation platform but also paves the way for the
development of expert-level AI.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要