谷歌浏览器插件
订阅小程序
在清言上使用

Towards Weakly Supervised Text-to-Audio Grounding

IEEE Transactions on Multimedia(2024)

引用 0|浏览28
暂无评分
摘要
Text-to-audio grounding (TAG) task aims to predict the onsets and offsets ofsound events described by natural language. This task can facilitateapplications such as multimodal information retrieval. This paper focuses onweakly-supervised text-to-audio grounding (WSTAG), where frame-levelannotations of sound events are unavailable, and only the caption of a wholeaudio clip can be utilized for training. WSTAG is superior tostrongly-supervised approaches in its scalability to large audio-text datasets.Two WSTAG frameworks are studied in this paper: sentence-level andphrase-level. First, we analyze the limitations of mean pooling used in theprevious WSTAG approach and investigate the effects of different poolingstrategies. We then propose phrase-level WSTAG to use matching labels betweenaudio clips and phrases for training. Advanced negative sampling strategies andself-supervision are proposed to enhance the accuracy of the weak labels andprovide pseudo strong labels. Experimental results show that our systemsignificantly outperforms the previous WSTAG SOTA. Finally, we conductextensive experiments to analyze the effects of several factors on phrase-levelWSTAG. The code and model is available athttps://github.com/wsntxxn/TextToAudioGrounding.
更多
查看译文
关键词
text-to-audio grounding,weakly-supervised learning,negative sampling,audio-text representation,clustering
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要