Exploring the limits of decoder-only models trained on public speech recognition corpora
CoRR(2024)
摘要
The emergence of industrial-scale speech recognition (ASR) models such as
Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio
only proprietary data respectively, has led to a stronger need for large scale
public ASR corpora and competitive open source pipelines. Unlike the said
models, large language models are typically based on Transformer decoders, and
it remains unclear if decoder-only models trained on public data alone can
deliver competitive performance. In this work, we investigate factors such as
choice of training datasets and modeling components necessary for obtaining the
best performance using public English ASR corpora alone. Our Decoder-Only
Transformer for ASR (DOTA) model comprehensively outperforms the
encoder-decoder open source replication of Whisper (OWSM) on nearly all English
ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We
release our codebase and model checkpoints under permissive license.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要