SoftSpeech: Unsupervised Duration Model in FastSpeech 2.

Yuanhao Yi,Lei He, Shifeng Pan,Xi Wang, Yuchao Zhang

Conference of the International Speech Communication Association (INTERSPEECH)（2022）

引用 1|浏览7

暂无评分

摘要

In this paper, we propose a neural Text-To-Speech (TTS) system SoftSpeech, which employs a novel soft length regulated duration attention based decoder. It learns the encoder output mapping to decoder output simultaneously from an unsupervised duration model (Soft-LengthRegulator) without the requirement of external duration information. The Soft-LengthRegulator consists of a Feed-Forward Transformer (FFT) block with Conditional Layer Normalization (CLN), following a learned upsampling layer with multi-head attention and guided multi-head attention constraint, and it is integrated in each decoder layer and achieves accelerated training convergence and better naturalness within FastSpeech 2 framework. Soft Dynamic Time Warping (Soft-DTW) is adopted to align the mismatch spectrogram loss. Moreover, a Fine-Grained style Variational AutoEncoder (VAE) is designed to further improve the naturalness of synthesized speech. The experiments show SoftSpeech outperforms FastSpeech 2 in subjective tests, and can be successfully applied to minority languages with low resources.

查看译文

关键词

TTS, Soft-DTW, Attention, FastSpeech, VAE

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要