Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
arXiv (Cornell University)(2023)
Abstract
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting
of 50,000 hours of read English speech derived from LibriVox. To the best of
our knowledge, Libriheavy is the largest freely-available corpus of speech with
supervisions. Different from other open-sourced datasets that only provide
normalized transcriptions, Libriheavy contains richer information such as
punctuation, casing and text context, which brings more flexibility for system
building. Specifically, we propose a general and efficient pipeline to locate,
align and segment the audios in previously published Librilight to its
corresponding texts. The same as Librilight, Libriheavy also has three training
subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We
also extract the dev and test evaluation sets from the aligned audios and
guarantee there is no overlapping speakers and books in training sets. Baseline
systems are built on the popular CTC-Attention and transducer models.
Additionally, we open-source our dataset creatation pipeline which can also be
used to other audio alignment tasks.
MoreTranslated text
Key words
hours asr corpus,punctuation casing,libriheavy,context
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined