ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation
arxiv(2024)
摘要
Web-scale training on paired text-image data is becoming increasingly central
to multimodal learning, but is challenged by the highly noisy nature of
datasets in the wild. Standard data filtering approaches succeed in removing
mismatched text-image pairs, but permit semantically related but highly
abstract or subjective text. These approaches lack the fine-grained ability to
isolate the most concrete samples that provide the strongest signal for
learning in a noisy dataset. In this work, we propose a new metric, image
caption concreteness, that evaluates caption text without an image reference to
measure its concreteness and relevancy for use in multimodal learning. Our
approach leverages strong foundation models for measuring visual-semantic
information loss in multimodal representations. We demonstrate that this
strongly correlates with human evaluation of concreteness in both single-word
and sentence-level texts. Moreover, we show that curation using ICC complements
existing approaches: It succeeds in selecting the highest quality samples from
multimodal web-scale datasets to allow for efficient training in
resource-constrained settings.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要