MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
arxiv(2024)
摘要
Large Vision-Language Models (LVLMs) show significant strides in
general-purpose multimodal applications such as visual dialogue and embodied
navigation. However, existing multimodal evaluation benchmarks cover a limited
number of multimodal tasks testing rudimentary capabilities, falling short in
tracking LVLM development. In this study, we present MMT-Bench, a comprehensive
benchmark designed to assess LVLMs across massive multimodal tasks requiring
expert knowledge and deliberate visual recognition, localization, reasoning,
and planning. MMT-Bench comprises 31,325 meticulously curated multi-choice
visual questions from various multimodal scenarios such as vehicle driving and
embodied navigation, covering 32 core meta-tasks and 162 subtasks in
multimodal understanding. Due to its extensive task coverage, MMT-Bench enables
the evaluation of LVLMs using a task map, facilitating the discovery of in- and
out-of-domain tasks. Evaluation results involving 30 LVLMs such as the
proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore
the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench
will inspire the community to develop next-generation multimodal foundation
models aimed at achieving general-purpose multimodal intelligence.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要