OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
CoRR(2024)
摘要
There has been an increasing interest in large speech models that can perform
multiple speech processing tasks in a single model. Such models usually adopt
the encoder-decoder or decoder-only architecture due to their popularity and
good performance in many domains. However, autoregressive models can be slower
during inference compared to non-autoregressive models and also have potential
risks of hallucination. Though prior studies observed promising results of
non-autoregressive models for certain tasks at small scales, it remains unclear
if they can be scaled to speech-to-text generation in diverse languages and
tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we
propose OWSM-CTC, a novel encoder-only speech foundation model based on
Connectionist Temporal Classification (CTC). It is trained on 180k hours of
public audio data for multilingual automatic speech recognition (ASR), speech
translation (ST), and language identification (LID). Compared to
encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up
to 25
faster for inference. OWSM-CTC also improves the long-form ASR result with 20x
speed-up. We will publicly release our codebase, pre-trained model, and
training logs to promote open science in speech foundation models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要