Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text Retrieval
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)
摘要
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio
clip (A2T) and vice versa (T2A), has recently attracted much research
attention. Existing methods typically aggregate information from each modality
into a single vector for matching, but this sacrifices local details and can
hardly capture intricate relationships within and between modalities.
Furthermore, current ATR datasets lack comprehensive alignment information, and
simple binary contrastive learning labels overlook the measurement of
fine-grained semantic differences between samples. To counter these challenges,
we present a novel ATR framework that comprehensively captures the matching
relationships of multimodal information from different perspectives and finer
granularities. Specifically, a fine-grained alignment method is introduced,
achieving a more detail-oriented matching through a multiscale process from
local to global levels to capture meticulous cross-modal relationships. In
addition, we pioneer the application of cross-modal similarity consistency,
leveraging intra-modal similarity relationships as soft supervision to boost
more intricate alignment. Extensive experiments validate the effectiveness of
our approach, outperforming previous methods by significant margins of at least
3.9
(A2T) R@1 on the Clotho dataset.
更多查看译文
关键词
audio-text retrieval,multiscale matching,cross-modal similarity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要