MAFA-Net: Multimodal Attribute Feature Attention Network for visual question answering

Research Square (Research Square)(2023)

引用 0|浏览3
暂无评分
摘要
Abstract Visual Question Answering (VQA) is a hot topic task to answer natural language questions related to the content of visual images. In most VQA models, visual appearance and attribute features are ignored, resulting in complex questions without correct answers.To solve these problems, we propose a new end-to-end VQA model called Multi-modal Attribute Feature Attention Network (MAFA-Net).Firstly, the self-guided word attention modulus is designed to connect entity words with semantic words. Secondly, two problematic adaptive visual attention modules are presented not only to extract important regional features, but also to focus on key attribute features (e.g., color, spatial relationships, etc.). Additionally, a combining strategy is proposed to better explore spatial relationships between objects and their appearance properties. Finally, the experimental results show that MAFA-Net achieves performance competitive with state-of-the-art models on two large-scale VQA datasets.
更多
查看译文
关键词
attention,visual question,mafa-net
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要