Chrome Extension
WeChat Mini Program
Use on ChatGLM

Data augmentation for asr using tts via a discrete representation

Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)(2021)

Cited 6|Views6
No score
Abstract
While end-to-end automatic speech recognition (ASR) has achieved high performance, it requires a huge amount of paired speech and transcription data for training Recently, data augmentation methods have actively been investigated. One method is to use a text-to-speech (TTS) system to generate speech data from text-only data and use the generated speech for data augmentation, but it has been found that the synthesized log Mel-scale filterbank (lmfb) features could have a serious mismatch with the real speech features. In this study, we propose a data augmentation method via a discrete speech representation. The TTS model predicts discrete ID sequences instead of lmfb features, and the ASR also uses the ID sequences as training data. We expect that the use of a discrete representation based on vq-wav2vec not only makes TTS training easier but also mitigates the mismatch with real data. Experimental evaluations show that the proposed method outperforms the data augmentation method using the conventional TTS. We found that it reduces speaker dependency, and the generated features are distributed more closely to the real ones.
More
Translated text
Key words
Speech recognition,Sequence-to-sequence model,Data augmentation,Vq-wav2vec,Speech synthesis
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined