EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
arxiv(2024)
摘要
Recent advancements in self-supervised audio-visual representation learning
have demonstrated its potential to capture rich and comprehensive
representations. However, despite the advantages of data augmentation verified
in many learning methods, audio-visual learning has struggled to fully harness
these benefits, as augmentations can easily disrupt the correspondence between
input pairs. To address this limitation, we introduce EquiAV, a novel framework
that leverages equivariance for audio-visual contrastive learning. Our approach
begins with extending equivariance to audio-visual learning, facilitated by a
shared attention-based transformation predictor. It enables the aggregation of
features from diverse augmentations into a representative embedding, providing
robust supervision. Notably, this is achieved with minimal computational
overhead. Extensive ablation studies and qualitative results verify the
effectiveness of our method. EquiAV outperforms previous works across various
audio-visual benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要