Hierarchical RNNs for Waveform-Level Speech Synthesis.

SLT(2018)

引用 2|浏览63
暂无评分
摘要
Speech synthesis technology has a wide range of applications such as voice assistants. In recent years waveform-level synthesis systems have achieved state-of-the-art performance, as they overcome the limitations of vocoder-based synthesis systems. A range of waveform-level synthesis systems have been proposed; this paper investigates the performance of hierarchical Recurrent Neural Networks (RNNs) for speech synthesis. First, the form of network conditioning is discussed, comparing linguistic features and vocoder features from a vocoder-based synthesis system. It is found that compared with linguistic features, conditioning on vocoder features requires less data and modeling power, and yields better performance when there is limited data. By conditioning the hierarchical RNN on vocoder features, this paper develops a neural vocoder, which is capable of high quality synthesis when there is sufficient data. Furthermore, this neural vocoder is flexible, as conceptually it can map any sequence of vocoder features to speech, enabling efficient synthesizer porting to a target speaker. Subjective listening tests demonstrate that the neural vocoder outperforms a high quality baseline, and that it can change its voice to a very different speaker, given less than 15 minutes of data for fine tuning.
更多
查看译文
关键词
Vocoders,Speech synthesis,Mathematical model,Linguistics,Feature extraction,Data models,History
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要