Scene-text Oriented Visual Entailment: Task, Dataset and Solution

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览0
暂无评分
摘要
Visual Entailment (VE) is a fine-grained reasoning task aiming to predict whether the image semantically entails a hypothesis in textual form.Existing studies of VE only focus on basic visual attributes but largely overlook the importance of scene text, which usually entails rich semantic information and crucial clues (e.g., time, place, affiliation, and topic), leading to superficial design of hypothesis or incorrect entailment prediction. To fill this gap, we propose a new task called scene-text oriented Visual Entailment (STOVE), which requires models to predict whether an image semantically entails the corresponding hypothesis designed based on the scene text-centered visual information.STOVE task challenges a model to deeply understand the interplay between language and images containing scene text, requiring aligning hypotheses tokens, scene text, and visual contents.To support the researches on STOVE, we further collect a dataset termed TextVE, consisting of 23,864 images and 47,728 hypotheses related to scene text, which is constructed with the strategy of minimizing biases.Additionally, we present a baseline named MMTVE applying a multimodal transformer to model the spatial, semantic, and visual reasoning relations between multiple scene text tokens, hypotheses, and visual features.Experimental results illustrate that our model is effective in comprehending STOVE and achieves outstanding performance.Our codes are available at https://github.com/VISLANG-Lab/TextVE.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要