Image-Text Retrieval With Cross-Modal Semantic Importance Consistency

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 2|浏览109
暂无评分
摘要
Cross-modal image-text retrieval is an important area of Vision-and-Language task that models the similarity of image-text pairs by embedding features into a shared space for alignment. To bridge the heterogeneous gap between the two modalities, current approaches achieve inter-modal alignment and intra-modal semantic relationship modeling through complex weighted combinations between items. In the intra-modal association and inter-modal interaction processes, the higher-weight items have a higher contribution to the global semantics. However, the same item always produces different contributions in the two processes, since most traditional approaches only focus on the alignment. This usually results in semantic changes and misalignment. To address this issue, this paper proposes Cross-modal Semantic Importance Consistency (CSIC) which achieves invariance in the semantic of items during aligning. The proposed technique measures the semantic importance of items obtained from intra-modal and inter-modal self-attention and learns a more reasonable representation vector by inter-calibrating the importance distribution to improve performance. We conducted extensive experiments on the Flickr30K and MS COCO datasets. The results show that our approach can significantly improve retrieval performance, proving the proposed approach’s superiority and rationality.
更多
查看译文
关键词
Cross-modal,image-text retrieval,self-attention,senmentic importance,alignment
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要