Multimodal System for Audio Scene Source Counting and Analysis

IEEE/ACM Transactions on Audio, Speech, and Language Processing(2022)

引用 3|浏览3
暂无评分
摘要
Audio scene analysis (ASA) is a challenging and multifaceted task in audio signal processing that uncovers information about the nature of an audio recording. Regardless of the analysis goal, a number of audio sources are observed in any audio scene. However, this consideration is usually not explored or given considerable thought in research. This work aims to demonstrate the utility of audio source counting with a novel solution consisting of a multimodal system for ASA. Both speaker counting and sound event counting techniques use deep neural networks (DNN) to predict the number of sources. We are able to present competitive results for audio source counting by achieving prediction accuracy of 46.03% and 89.57% with a margin of error of $\pm 1$ for speaker counting, which outperforms state-of-the-art systems for similar tasks. For sound event counting we achieve 50.55% and 86.59% prediction accuracy and accuracy with a margin of error of $\pm 1$ , respectively, that establishes a clear baseline. Our system also demonstrates real-time aspects with an overall processing time of $\sim 0.4614$ s per audio recording.
更多
查看译文
关键词
Audio scene analysis,source counting,speaker count estimation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要