CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
CoRR(2024)
摘要
Multimodal large language models (MLLMs) have demonstrated promising results
in a variety of tasks that combine vision and language. As these models become
more integral to research and applications, conducting comprehensive
evaluations of their capabilities has grown increasingly important. However,
most existing benchmarks fail to consider that, in certain situations, images
need to be interpreted within a broader context. In this work, we introduce a
new benchmark, named as CODIS, designed to assess the ability of models to use
context provided in free-form text to enhance visual comprehension. Our
findings indicate that MLLMs consistently fall short of human performance on
this benchmark. Further analysis confirms that these models struggle to
effectively extract and utilize contextual information to improve their
understanding of images. This underscores the pressing need to enhance the
ability of MLLMs to comprehend visuals in a context-dependent manner. View our
project website at https://thunlp-mt.github.io/CODIS.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要