Dual-adaptive interactive transformer with textual and visual context for image captioning

Lizhi Chen, Kesen Li

EXPERT SYSTEMS WITH APPLICATIONS(2024)

引用 0|浏览2
暂无评分
摘要
The multimodal Transformer, which integrates visual and textual contextual information, has recently shown success in image captioning tasks. However, there is still natural complementarity and redundancy between text and vision, and effectively integrating the information from both modalities is crucial for comprehending the content of an image. In this paper, we propose the Dual-Adaptive Interactive Transformer (DAIT), which incorporates similar textual and visual contextual information into the encoding and decoding stages. Specifically, during encoding, we propose the Adaptive Interactive Encoder (AIE), which expands the feature vectors for both modalities through the introduction of new operations. Furthermore, we also introduce normalization gate factors to mitigate noise caused by the interaction between the two modalities. During decoding, we propose the Adaptive Interactive Decoder (AID), which adaptively adjusts the multimodal features at each moment through similarity-weighted textual and visual branches. To validate our model, we conducted extensive experiments on the MS COCO image captioning dataset and achieved outstanding performance compared to many state-of-theart methods.
更多
查看译文
关键词
Image captioning,Transformer,Textual and visual,Encoder-decoder,Adaptive Interactive
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要