Improving RNN-T ASR Accuracy Using Untranscribed Context Audio

Andreas Schwarz,Ilya Sklyar,Simon Wiesler

CoRR（2020）

引用 0|浏览3

暂无评分

摘要

We present a new training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to benefit from longer audio streams as input, while only requiring partial transcriptions of such streams during training. We show that this extension of the acoustic context during training and inference can lead to word error rate reductions of more than 6% in a realistic production setting. We investigate its effect on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation. Finally, we visualize RNN-T loss gradients with respect to the input features in order to illustrate the ability of a long short-term memory (LSTM) based ASR encoder to exploit long-term context.

查看译文

关键词

untranscribed context audio,accuracy

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要