VoxMM: Rich Transcription of Conversations in the Wild

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览1
暂无评分
摘要
This paper presents a multi-modal dataset that contains rich transcriptions of spoken conversations. As diverse multi-modal and multi-task models emerge, there is a growing need for multi-modal training and evaluation datasets accompanied by rich metadata. However, there is no universal dataset that addresses these requirements for the diverse tasks partially due to the cost of annotation. To overcome this limitation, we develop a semi-automatic pipeline that makes the annotation more feasible. The resulting dataset is VoxMM, a multi-modal, multi-domain dataset. VoxMM incorporates video, audio, and text modalities. In terms of labels, it offers a wide array of metadata such as speaker labels, transcriptions, gender, and more. VoxMM supports both the training and the evaluation of any-to-any modality mapping models. It also offers a more accurate representation of real-world scenarios, bridging the gap between controlled laboratory experiments and the varying performances in the real-world. We present initial benchmarks on automatic speech recognition and speaker diarisation. The VoxMM dataset can be downloaded from https://mm.kaist.ac.kr/projects/voxmm
更多
查看译文
关键词
Audio-Visual,Dataset,Speech Recognition,Speaker Diarisation,Speaker Recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要