谷歌浏览器插件
订阅小程序
在清言上使用

Alignment-Learning Based Single-Step Decoding for Accurate and Fast Non-Autoregressive Speech Recognition.

IEEE International Conference on Acoustics, Speech, and Signal Processing(2022)

引用 2|浏览21
暂无评分
摘要
Non-autoregressive transformer (NAT) based speech recognition models have gained more and more attention since they perform faster inference speed compared with autoregressive counterparts, especially when the single-step decoding is applied. However, the single-step decoding process with length prediction will suffer from the decoding stability problem and limited improvement for inference speed. To address this, in this paper, we propose an alignment learning based NAT model, named AL-NAT. Our idea is inspired by the fact that the encoder CTC output and the target sequence are monotonically related. Specifically, we design an alignment cost matrix between the CTC output tokens and the target tokens and define a novel alignment loss to minimize the distance between the alignment cost matrix and the ground truth monotonic alignment path. By eliminating the length prediction mechanism, our AL-NAT model achieves remarkable improvements in recognition accuracy and decoding speed. To learn the contextual knowledge to improve the decoding accuracy, we further add lightweight language model on both the encoder and decoder side. Our proposed method achieves WERs of 2.8%/6.3% and RTF of 0.011 on Librispeech test clean/other sets with a lightweight 3-gram LM, and a CER of 5.3% and RTF of 0.005 on Aishell1 without LM, respectively.
更多
查看译文
关键词
Non-autoregressive transformer,Speech recognition,Alignment learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要