A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)
摘要
Recently, there has been an increasing focus on audio-text cross-modal
learning. However, most of the existing audio-text datasets contain only simple
descriptions of sound events. Compared with classification labels, the
advantages of such descriptions are significantly limited. In this paper, we
first analyze the detailed information that human descriptions of audio may
contain beyond sound event labels. Based on the analysis, we propose an
automatic pipeline for curating audio-text pairs with rich details. Leveraging
the property that sounds can be mixed and concatenated in the time domain, we
control details in four aspects: temporal relationship, loudness, speaker
identity, and occurrence number, in simulating audio mixtures. Corresponding
details are transformed into captions by large language models. Audio-text
pairs with rich details in text descriptions are thereby obtained. We validate
the effectiveness of our pipeline with a small amount of simulated data,
demonstrating that the simulated data enables models to learn detailed audio
captioning.
更多查看译文
关键词
Detailed audio captioning,audio-text learning,data curation pipeline
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要