A non-linear source-filter based vocoder with prosody control.

NCC(2023)

引用 0|浏览5
暂无评分
摘要
Speech signal reconstruction from its compact acoustic representation is a challenging task. Although the acoustic representations obtained from the speech processing systems (like Text-to-speech synthesis, speech enhancement, etc.) are highly accurate, the performance of the vocoder affects the naturalness of the synthesized speech. Conventional vocoders are based on the linear source-filter model of the human speech production mechanism. But, we can’t incorporate them in training an end-to-end model, and they are vulnerable to the estimated acoustic representations. Neural vocoders like WaveNet can be incorporated in training end-to-end models. But the complexity and the inference time are pretty high and do not have provision to control the prosody. In this paper, we propose a neural network based compact non-linear vocoder with prosody control using the source-filter model of the human speech production mechanism. We can effectively control the prosody of the synthesized speech by controlling the prosodic parameters like fundamental frequency $(f_{0})$ without affecting the naturalness of the speech. The model achieves a better performance with a mean opinion score (MOS) of 4.09, with a much lower real-time factor and model complexity.
更多
查看译文
关键词
Speech reconstruction,source-filter model,neural vocoder,prosody control,multi-head attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要