FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
arxiv(2024)
Abstract
Food is a rich and varied dimension of cultural heritage, crucial to both
individuals and social groups. To bridge the gap in the literature on the
often-overlooked regional diversity in this domain, we introduce FoodieQA, a
manually curated, fine-grained image-text dataset capturing the intricate
features of food cultures across various regions in China. We evaluate
vision-language Models (VLMs) and large language models (LLMs) on newly
collected, unseen food images and corresponding questions. FoodieQA comprises
three multiple-choice question-answering tasks where models need to answer
questions based on multiple images, a single image, and text-only descriptions,
respectively. While LLMs excel at text-based question answering, surpassing
human accuracy, the open-sourced VLMs still fall short by 41% on multi-image
and 21% on single-image VQA tasks, although closed-weights models perform
closer to human levels (within 10%). Our findings highlight that understanding
food and its cultural implications remains a challenging and under-explored
direction.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined