Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System

INTERSPEECH(2019)

引用 19|浏览3
暂无评分
摘要
Automatic sung speech recognition is a relatively understudied topic that has been held back by a lack of large and freely available datasets. This has recently changed thanks to the release of the DAMP Sing! dataset, a 1100 hour karaoke dataset originating from the social music-making company, Smule. This paper presents work undertaken to define an easily replicable, automatic speech recognition benchmark for this data. In particular, we describe how transcripts and alignments have been recovered from Karaoke prompts and timings; how suitable training, development and test sets have been defined with varying degrees of accent variability; and how language models have been developed using lyric data from the LyricWikia website. Initial recognition experiments have been performed using factored-layer TDNN acoustic models with lattice-free MMI training using Kaldi. The best WER is 19.60% - a new state-of-the-art for this type of data. The paper concludes with a discussion of the many challenging problems that remain to be solved. Dataset definitions and Kaldi scripts have been made available so that the benchmark is easily replicable.
更多
查看译文
关键词
Lyrics, Singing, Speech Recognition, Lyrics Transcription, DAMP
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要