MAFA-Net: Multimodal Attribute Feature Attention Network for visual question answering

Mingwei Tang, Chaochong Ran,Liansong Zong,Jie Hu,Linxi Li

Research Square (Research Square)（2023）

引用 0|浏览3

暂无评分

摘要

Abstract Visual Question Answering (VQA) is a hot topic task to answer natural language questions related to the content of visual images. In most VQA models, visual appearance and attribute features are ignored, resulting in complex questions without correct answers.To solve these problems, we propose a new end-to-end VQA model called Multi-modal Attribute Feature Attention Network (MAFA-Net).Firstly, the self-guided word attention modulus is designed to connect entity words with semantic words. Secondly, two problematic adaptive visual attention modules are presented not only to extract important regional features, but also to focus on key attribute features (e.g., color, spatial relationships, etc.). Additionally, a combining strategy is proposed to better explore spatial relationships between objects and their appearance properties. Finally, the experimental results show that MAFA-Net achieves performance competitive with state-of-the-art models on two large-scale VQA datasets.

查看译文

关键词

attention,visual question,mafa-net

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要