Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
NeurIPS 2023(2024)
摘要
Audio-Visual Source Localization (AVSL) aims to locate sounding objects
within video frames given the paired audio clips. Existing methods
predominantly rely on self-supervised contrastive learning of audio-visual
correspondence. Without any bounding-box annotations, they struggle to achieve
precise localization, especially for small objects, and suffer from blurry
boundaries and false positives. Moreover, the naive semi-supervised method is
poor in fully leveraging the information of abundant unlabeled data. In this
paper, we propose a novel semi-supervised learning framework for AVSL, namely
Dual Mean-Teacher (DMT), comprising two teacher-student structures to
circumvent the confirmation bias issue. Specifically, two teachers, pre-trained
on limited labeled data, are employed to filter out noisy samples via the
consensus between their predictions, and then generate high-quality
pseudo-labels by intersecting their confidence maps. The sufficient utilization
of both labeled and unlabeled data and the proposed unbiased framework enable
DMT to outperform current state-of-the-art methods by a large margin, with CIoU
of 90.4
9.6
respectively, given only 3
framework to some existing AVSL methods and consistently boost their
performance.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要