Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech

Yuanhao Yi,Lei He,Shifeng Pan,Xi Wang,Yujia Xiao

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2022）

引用 8|浏览44

暂无评分

摘要

This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate "content leakage" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.

查看译文

关键词

TTS,Prosody,MIST,Attention,Fewshot

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要