TASK 1B DCASE 2021: AUDIO-VISUAL SCENE CLASSIFICATION WITH SQUEEZE-EXCITATION CONVOLUTIONAL RECURRENT NEURAL NETWORKS Technical Report

semanticscholar(2021)

引用 0|浏览0
暂无评分
摘要
Automatic scene classification has always been one of the core tasks in every edition of the DCASE challenge. Until this edition, such classification was performed using only audio data, and so the problematic was defined as Acoustic Scene Classification (ASC). In this 2021 edition, audio data is accompanied with visual data, providing additional information that can be jointly exploited for achieving higher recognition accuracy. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. After training each network, the fusion of information from the audio and visual subnetworks is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. For the visual subnetwork, a VGG16 architecture pretrained on the Places365 dataset is used, applying a fine-tuning strategy over the Challenge dataset. On the other hand, the audio subnetwork is trained from scratch and uses squeezeexcitation techniques as in previous contributions from this team. As a result, the final accuracy of the system is 92% on development split, outperforming the baseline by 15 percentage points.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要