Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition.

INTERSPEECH(2020)

引用 7|浏览19
暂无评分
摘要
In this paper, we propose an utterance invariant training (UIT) specifically designed to improve the performance of a two-pass end-to-end hybrid ASR. Our proposed hybrid ASR solution uses a shared encoder with a monotonic chunkwise attention (MoChA) decoder for streaming capabilities, while using a low-latency bidirectional full-attention (BFA) decoder for enhancing the overall ASR accuracy. A modified sequence summary network (SSN) based utterance invariant training is used to suit the two-pass model architecture. The input feature stream self-conditioned by scaling and shifting with its own sequence summary is used as a concatenative conditioning on the bidirectional encoder layers sitting on top of the shared encoder. In effect, the proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive. Experimental results show that the proposed approach shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.
更多
查看译文
关键词
speech recognition, ASR, utterance invariant training, sequence summary network, two-pass ASR, streaming ASR
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要