Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios
arxiv(2024)
摘要
This study presents an audio-visual information fusion approach to sound
event localization and detection (SELD) in low-resource scenarios. We aim at
utilizing audio and video modality information through cross-modal learning and
multi-modal fusion. First, we propose a cross-modal teacher-student learning
(TSL) framework to transfer information from an audio-only teacher model,
trained on a rich collection of audio data with multiple data augmentation
techniques, to an audio-visual student model trained with only a limited set of
multi-modal data. Next, we propose a two-stage audio-visual fusion strategy,
consisting of an early feature fusion and a late video-guided decision fusion
to exploit synergies between audio and video modalities. Finally, we introduce
an innovative video pixel swapping (VPS) technique to extend an audio channel
swapping (ACS) method to an audio-visual joint augmentation. Evaluation results
on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2023
Challenge data set demonstrate significant improvements in SELD performances.
Furthermore, our submission to the SELD task of the DCASE 2023 Challenge ranks
first place by effectively integrating the proposed techniques into a model
ensemble.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要