谷歌Chrome浏览器插件
订阅小程序
在清言上使用

Answer-Based Entity Extraction and Alignment for Visual Text Question Answering

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览27
暂无评分
摘要
As a variant of visual question answering (VQA), visual text question answering (VTQA) provides a text-image pair for each question. Text utilizes named entities to describe corresponding image. Consequently, the ability to perform multi-hop reasoning using named entities between text and image becomes critically important. However, existing models pay relatively less attention to this aspect. Therefore, we propose Answer-Based Entity Extraction and Alignment Model (AEEA) to enable a comprehensive understanding and support multi-hop reasoning. The core of AEEA lies in two main components: AKECMR and answer aware predictor. The former emphasizes the alignment of modalities and effectively distinguishes between intra-modal and inter-modal information, and the latter prioritizes the full utilization of intrinsic semantic information contained in answers during training. Our model outperforms the baseline by 2.24% on test-dev set and 1.06% on test set, securing the third place in VTQA2023(English).
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要