CLAPSep: Leveraging Contrastive Pre-trained Model for Multi-Modal Query-Conditioned Target Sound Extraction
arxiv(2024)
摘要
Universal sound separation (USS) aims to extract arbitrary types of sounds
from real-world recordings. This can be achieved by language-queried target
sound extraction (TSE), which typically consists of two components: a query
network that converts user queries into conditional embeddings, and a
separation network that extracts the target sound accordingly. Existing methods
commonly train models from scratch. As a consequence, substantial data and
computational resources are required to improve the models' performance and
generalizability. In this paper, we propose to integrate pre-trained models
into TSE models to address the above issue. To be specific, we tailor and adapt
the powerful contrastive language-audio pre-trained model (CLAP) for USS,
denoted as CLAPSep. CLAPSep also accepts flexible user inputs, taking both
positive and negative user prompts of uni- and/or multi-modalities for target
sound extraction. These key features of CLAPSep can not only enhance the
extraction performance but also improve the versatility of its application. We
provide extensive experiments on 5 diverse datasets to demonstrate the superior
performance and zero- and few-shot generalizability of our proposed CLAPSep
with fast training convergence, surpassing previous methods by a significant
margin. Full codes and some audio examples are released for reproduction and
evaluation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要