Multi-Speaker Sequence-To-Sequence Speech Synthesis For Data Augmentation In Acoustic-To-Word Speech Recognition

2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)(2019)

引用 31|浏览42
暂无评分
摘要
The acoustic-to-word ( A2W) automatic speech recognition ( ASR) realizes very fast decoding with a simple architecture and achieves state-of-the-art performance. However, the A2W model suffers from the out-of-vocabulary ( OOV) word problem and cannot use text-only data to improve the language modeling capability. Meanwhile, sequence-to-sequence neural speech synthesis has also been developed and achieved naturalness comparable to human speech. We investigate leveraging sequence-to-sequence neural speech synthesis to augment training data for the ASR system in a target domain. While speech synthesis model is usually trained with single speaker data, ASR needs to cover a variety of speakers. In this work, we extend the speech synthesizer so that it can output speech of many speakers. The multi-speaker speech synthesizer is trained with a large corpus in the source domain, then used to generate acoustic features from texts of the target domain. These synthesized speech features are combined with real speech features of the source domain to train an attention-based A2W model. Experimental results show that the A2W model trained with the multi-speaker model achieved a significant improvement over the baseline and the single speaker model.
更多
查看译文
关键词
Sequence-to-sequence speech recognition, Sequence-to-sequence speech synthesis, acoustic-to-word model, training data augmentation, multi-speaker speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要