MAPM: multiscale attention pre-training model for TextVQA

Applied Intelligence（2024）

引用 0|浏览0

暂无评分

摘要

Text Visual Question Answering (TextVQA) task aims to enable models to read and answer questions based on images with text. Existing attention-based methods for TextVQA tasks often face challenges in effectively aligning local features between modalities during multimodal information interaction. This misalignment hinders their performance in accurately answering questions based on images with text. To address this issue, the Multiscale Attention Pre-training Model (MAPM) is proposed to enhance multimodal feature fusion. MAPM introduces the multiscale attention modules, which facilitate finegrained local feature enhancement and global feature fusion across modalities. By adopting these modules, MAPM achieves superior performance in aligning and integrating visual and textual information. Additionally, MAPM benefits from being pre-trained with scene text, employing three pre-training tasks: masked language model, visual region matching, and OCR visual text matching. This pre-training process establishes effective semantic alignment relationships among different modalities. Experimental evaluations demonstrate the superiority of MAPM, achieving a 1.2

查看译文

关键词

Attention mechanisms,Cross-modal semantic alignment,Pre-training,Text visual question answering

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要