Few-Shot Custom Speech Synthesis with Multi-Angle Fusion

2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP)(2023)

引用 0|浏览4
暂无评分
摘要
In this paper, we propose the TDNN-VITS model: an efficient custom speech synthesis system that can synthesize the speech of arbitrary target speakers, well-improving speech quality as well as reducing the amount of adaptive data. Our model consists of a speaker encoding module, which aims to extract speaker timbre information, and a TTS module, which is based on the fully end-to-end VITS model, but improved according to the problems of custom speech. To better improve the speech quality and speaker similarity, we propose a multi-angle fusion speaker embedding approach. With only 10 speech data and about one minute, good results can be achieved by just fine-tuning for half an hour. The experimental results show that our model is improved compared to the previous classical model, and good speech naturalness and speaker similarity are obtained. The ablation experiments also show the effectiveness of our multi-angle fusion.
更多
查看译文
关键词
adaptive TTS,speaker embedding,text-to-speech,few-shot,adversarial learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要