Synthesizing waveform sequence-to-sequence to augment training data for sequence-to-sequence speech recognition

ACOUSTICAL SCIENCE AND TECHNOLOGY(2021)

引用 1|浏览4
暂无评分
摘要
Sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-based model are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.
更多
查看译文
关键词
Speech recognition, Sequence-to-sequence model, Attention-based encoder-decoder model, Speech synthesis, Data augmentation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要