Unsupervised Audio-Visual Segmentation with Modality Alignment
arxiv(2024)
摘要
Audio-Visual Segmentation (AVS) aims to identify, at the pixel level, the
object in a visual scene that produces a given sound. Current AVS methods rely
on costly fine-grained annotations of mask-audio pairs, making them impractical
for scalability. To address this, we introduce unsupervised AVS, eliminating
the need for such expensive annotation. To tackle this more challenging
problem, we propose an unsupervised learning method, named Modality
Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf
foundation models like DINO, SAM, and ImageBind. This approach leverages their
knowledge complementarity and optimizes their joint usage for multi-modality
association. Initially, we estimate positive and negative image pairs in the
feature space. For pixel-level association, we introduce an audio-visual
adapter and a novel pixel matching aggregation strategy within the image-level
contrastive learning framework. This allows for a flexible connection between
object appearance and audio signal at the pixel level, with tolerance to
imaging variations such as translation and rotation. Extensive experiments on
the AVSBench (single and multi-object splits) and AVSS datasets demonstrate
that our MoCA outperforms strongly designed baseline methods and approaches
supervised counterparts, particularly in complex scenarios with multiple
auditory objects. Notably when comparing mIoU, MoCA achieves a substantial
improvement over baselines in both the AVSBench (S4: +17.24
AVSS (+19.23
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要