Light Supervised Data Selection, Voice Quality Normalized Training and Log Domain Pulse Synthesis

semanticscholar(2017)

引用 0|浏览7
暂无评分
摘要
Training acoustic models with, and synthesising, expressive speech is a challenge for Text-to-Speech (TTS) systems. The 2017 Blizzard Challenge offers an opportunity to tackle this problem by releasing data from “lively” recordings of children books. This paper describes the System J submission to the Blizzard Challenge 2017 Task EH1. Three potential approaches to handling expressive speech within a DNN-based system are discussed. First, mistranscribed and outlier content can be removed from the training data by using lightlysupervised training approaches. Second, the impact of paralinguistic information that cannot be predicted by the contextual labels is handled by marginalising out these aspects when training the acoustic model. This should reduce the implicit averaging effect that normally occurs. Finally, the system makes use of a new vocoder that has the potential to be more flexible than other state-of-the-art solutions. Results of the Challenge show that, even though the intelligibility and pauses are of reasonable quality and an internal test shows improvements using the new vocoder, the marginalisation over the voice quality removed most of the intonation and expressivity, leading to more degradation of the overall impression than expected.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要