A Reconstruction-based Visual-Acoustic-Semantic Embedding Method for Speech-Image Retrieval

w cheng, w tang,y huang, y luo

IEEE Transactions on Multimedia(2022)

引用 0|浏览2
暂无评分
摘要
Speech-image retrieval aims at learning the relevance between image and speech.Prior approaches are mainly based on bi-modal contrastive learning, which can not alleviate the cross-modal heterogeneous issue between visual and acoustic modalities well. To address this issue, we propose a visual-acoustic-semantic embedding (VASE) method. First, we propose a tri-modal ranking loss by taking advantage of semantic information corresponding to the acoustic data, which introduces the auxiliary alignment to enhance the alignment between image and speech. Second, we introduce a cycle-consistency loss based on feature reconstruction. It can further alleviate the heterogeneous issue between different data modalities ( e.g. , visual-acoustic, visual-textual and acoustic-textual). Extensive experiments have demonstrated the effectiveness of our proposed method. In addition, our VASE model achieves state-of-the-art performance on the speech-image retrieval task on the Flickr8K [Harwath and Glass, 2015]s and Places [Harwath et al. , 2018] datasets.
更多
查看译文
关键词
Speech-image retrieval, tri-modal ranking loss, cycle-consistency loss, visual-acoustic-semantic embedding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要