MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
CoRR(2024)
摘要
We introduce MuirBench, a comprehensive benchmark that focuses on robust
multi-image understanding capabilities of multimodal LLMs. MuirBench consists
of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that
involve 10 categories of multi-image relations (e.g., multiview, temporal
relations). Comprising 11,264 images and 2,600 multiple-choice questions,
MuirBench is created in a pairwise manner, where each standard instance is
paired with an unanswerable variant that has minimal semantic differences, in
order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our
results reveal that even the best-performing models like GPT-4o and Gemini Pro
find it challenging to solve MuirBench, achieving 68.0
Open-source multimodal LLMs trained on single images can hardly generalize to
multi-image questions, hovering below 33.3
highlight the importance of MuirBench in encouraging the community to develop
multimodal LLMs that can look beyond a single image, suggesting potential
pathways for future improvements.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要