Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing
CoRR(2024)
Abstract
Pretrained large Vision-Language models have drawn considerable interest in
recent years due to their remarkable performance. Despite considerable efforts
to assess these models from diverse perspectives, the extent of visual cultural
awareness in the state-of-the-art GPT-4V model remains unexplored. To tackle
this gap, we extensively probed GPT-4V using the MaRVL benchmark dataset,
aiming to investigate its capabilities and limitations in visual understanding
with a focus on cultural aspects. Specifically, we introduced three visual
related tasks, i.e. caption classification, pairwise captioning, and culture
tag selection, to systematically delve into fine-grained visual cultural
evaluation. Experimental results indicate that GPT-4V excels at identifying
cultural concepts but still exhibits weaker performance in low-resource
languages, such as Tamil and Swahili. Notably, through human evaluation, GPT-4V
proves to be more culturally relevant in image captioning tasks than the
original MaRVL human annotations, suggesting a promising solution for future
visual cultural benchmark construction.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined