VoxMM: Rich Transcription of Conversations in the Wild

Doyeop Kwak, Jaemin Jung,Kihyun Nam,Youngjoon Jang,Jee-Weon Jung,Shinji Watanabe,Joon Son Chung

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2024）

引用 0|浏览1

暂无评分

摘要

This paper presents a multi-modal dataset that contains rich transcriptions of spoken conversations. As diverse multi-modal and multi-task models emerge, there is a growing need for multi-modal training and evaluation datasets accompanied by rich metadata. However, there is no universal dataset that addresses these requirements for the diverse tasks partially due to the cost of annotation. To overcome this limitation, we develop a semi-automatic pipeline that makes the annotation more feasible. The resulting dataset is VoxMM, a multi-modal, multi-domain dataset. VoxMM incorporates video, audio, and text modalities. In terms of labels, it offers a wide array of metadata such as speaker labels, transcriptions, gender, and more. VoxMM supports both the training and the evaluation of any-to-any modality mapping models. It also offers a more accurate representation of real-world scenarios, bridging the gap between controlled laboratory experiments and the varying performances in the real-world. We present initial benchmarks on automatic speech recognition and speaker diarisation. The VoxMM dataset can be downloaded from https://mm.kaist.ac.kr/projects/voxmm

查看译文

关键词

Audio-Visual,Dataset,Speech Recognition,Speaker Diarisation,Speaker Recognition

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要