MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
arxiv(2023)
摘要
Multimodal large language models (MLLMs) (e.g., GPT-4V, LLaVA, and Claude-3)
have broadened the scope of AI applications. Yet, evaluating their performance
presents a significant challenge owing to the inherently subjective nature of
tasks that do not yield clear-cut solutions especially for those open-ended
queries. Existing automatic evaluation methodologies are mainly limited in
evaluating objective queries without considering real-world user experiences,
inadequately addressing the nuances of creative and associative multimodal
tasks. In our paper, we propose a new evaluation paradigm for MLLMs, which is
evaluating MLLMs with per-sample criteria using potent MLLM as the
judge. To validate the feasibility and effectiveness of this paradigm, we
design a benchmark, dubbed MLLM-Bench, with the evaluation samples
across six critical levels following the revised Bloom's Taxonomy with the
ethical consideration. We benchmark 21 popular MLLMs in a pairwise-comparison
fashion, showing diverse performance across models. Moreover, the validity of
our benchmark manifests itself in reaching 88.02% agreement with human
evaluation. We contend that the proposed paradigm explores the potential of
MLLMs as effective evaluation tools with the help of per-sample criteria, and
that MLLM-Bench will serve as a catalyst for encouraging the development of
user-centric MLLMs tailored to real-world applications. Our benchmark data,
online leaderboard and submission entry are at https://mllm-bench.llmzoo.com.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要