Fine-Grained Features Alignment and Fusion for Text-Video Cross-Modal Retrieval

Shuili Zhang, Hongzhang Mu,Quangang Li, Chenglong Xiao,Tingwen Liu

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览5
暂无评分
摘要
Text-video cross-modal retrieval is an increasingly prominent and challenging task that has garnered significant attention. Traditional models typically embed videos and texts into global vectors, aiming to capture the global features of these modalities. While the models often fall short in capturing fine-grained semantic details. Relying solely on global features proves insufficient to address this challenge. Hence, there is a pressing need to bridge the gap between different modalities by incorporating fine-grained features. In light of this, we propose a highly efficient model designed to capture the fine-grained features of videos and texts including question answer semantic alignment, object alignment and text-video feature fusion. For texts, our model includes the incorporation of entity information and part-of-speech information including adjectives, nouns and verbs information, while for videos, the identification of objects plays a crucial role in facilitating text-video retrieval. Our model undergoes extensive training on the WebVid and CC3M datasets, yielding unequivocal evidence of its superior performance over baseline models. It excels particularly in zero-shot text-video cross-modal retrieval tasks, offering substantial reductions in required computational resources.
更多
查看译文
关键词
Fine-grained feature alignment,text-video cross-modal retrieval,zero-shot retrieval
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要