Investigating the Role of Human Action Detector in Visual-guide Audio Source Separation System

Thanh Thi-Hien Duong, Trung-Hieu Nguyen, The Thanh-Dat Le,Thi-Lich Nghiem,Duc-Huy Pham,Thi-Lan Le

2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC(2023)

引用 0|浏览0
暂无评分
摘要
Visual-guided Audio Source Separation (VASS) is the task that deals with using available visual information to guide the audio separation from the mixture signal consisting of the sounds of many simultaneous sound sources. Visual information can be images of sound sources (e.g., musical instruments) or human gestures and activities (e.g., musicians). The visual features, audio features, and the correlation between audio and visual can be used to estimate the audio mask in the audio source separation model to improve the separation performance. In this study, we introduce a new multi-modal audio separation framework to jointly train the processing blocks helping the network to find the features of the target sounds more optimally. The proposed framework is introduced with three main blocks: visual extractor, action extractor, and audio separator. They combine visual, action, and spectral features to estimate the separated sounds' spectral masks. Extensive experiments have been conducted on the MUSIC dataset containing musical performance videos to evaluate the quality of human joint estimation and gesture representation and their role in audio separation. Experimental results obtained from different settings have confirmed the effectiveness of combining an action extractor with the audio-visual separation model.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要