Answer-Based Entity Extraction and Alignment for Visual Text Question Answering

Jun Yu,Mohan Jing,Weihao Liu,Tongxu Luo, Bingyuan Zhang,Keda Lu,Fangyu Lei,Jianqing Sun,Jiaen Liang

MM '23: Proceedings of the 31st ACM International Conference on Multimedia（2023）

引用 0|浏览27

暂无评分

摘要

As a variant of visual question answering (VQA), visual text question answering (VTQA) provides a text-image pair for each question. Text utilizes named entities to describe corresponding image. Consequently, the ability to perform multi-hop reasoning using named entities between text and image becomes critically important. However, existing models pay relatively less attention to this aspect. Therefore, we propose Answer-Based Entity Extraction and Alignment Model (AEEA) to enable a comprehensive understanding and support multi-hop reasoning. The core of AEEA lies in two main components: AKECMR and answer aware predictor. The former emphasizes the alignment of modalities and effectively distinguishes between intra-modal and inter-modal information, and the latter prioritizes the full utilization of intrinsic semantic information contained in answers during training. Our model outperforms the baseline by 2.24% on test-dev set and 1.06% on test set, securing the third place in VTQA2023(English).

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要