StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis
CoRR(2023)
摘要
The expressive quality of synthesized speech for audiobooks is limited by
generalized model architecture and unbalanced style distribution in the
training data. To address these issues, in this paper, we propose a
self-supervised style enhancing method with VQ-VAE-based pre-training for
expressive audiobook speech synthesis. Firstly, a text style encoder is
pre-trained with a large amount of unlabeled text-only data. Secondly, a
spectrogram style extractor based on VQ-VAE is pre-trained in a self-supervised
manner, with plenty of audio data that covers complex style variations. Then a
novel architecture with two encoder-decoder paths is specially designed to
model the pronunciation and high-level style expressiveness respectively, with
the guidance of the style extractor. Both objective and subjective evaluations
demonstrate that our proposed method can effectively improve the naturalness
and expressiveness of the synthesized speech in audiobook synthesis especially
for the role and out-of-domain scenarios.
更多查看译文
关键词
expressive speech synthesis,self-supervised style enhancing,VQ-VAE,pre-training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要