Gfl-TTS: A text-to-speech model that combines new tonal prediction and alignment

Yihao Wang, Junhui Niu, Xuegang Deng,Hao Li

2024 4th International Conference on Neural Networks, Information and Communication (NNICE)(2024)

引用 0|浏览0
暂无评分
摘要
Despite the considerable attention and successful generation of human-like speech by text-to-speech (TTS) models, there remains ample room for improvement in the naturalness of their speech output and the efficiency of speech-to-text alignment. This paper introduces Gfl-TTS, an acoustic model based on the Grad-TTS backbone network that incorporates prosody prediction and a novel alignment module. During training, this model predicts pitch contours, uniformly increases or decreases F0 information, and introduces an alignment module to enhance the prosody of generated audio. Experimental results conducted on the LJ speech dataset demonstrate that compared to Grad-TTS, this model achieves higher MOS (Mean Opinion Score) ratings. Moreover, Gfl-TTS exhibits faster inference speeds in comparison to Tacotron2 and Grad-TTS.
更多
查看译文
关键词
TTS,prosody prediction,non-autoregressive alignment,Grad-TTS
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要