Mels-Tts : Multi-Emotion Multi-Lingual Multi-Speaker Text-To-Speech System Via Disentangled Style Tokens

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览0
暂无评分
摘要
This paper proposes a multi-emotion, multi-lingual, and multi-speaker text-to-speech (MELS-TTS) system, employing disentangled style tokens for effective emotion transfer. In speech encompassing various attributes, such as emotional state, speaker identity, and linguistic style, disentangling these elements is crucial for an efficient multi-emotion, multi-lingual, and multi-speaker TTS system. To accomplish this purpose, we propose to utilize separate style tokens to disentangle emotion, language, speaker, and residual information, inspired by the global style tokens (GSTs). Through the attention mechanism, each style token learns its respective speech attribute from the target speech. Our proposed approach yields improved performance in both objective and subjective evaluations, demonstrating the ability to generate cross-lingual speech with diverse emotions, even from a neutral source speaker, while preserving the speaker’s identity.
更多
查看译文
关键词
Speech synthesis,emotional speech synthesis,emotion transfer,cross-lingual speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要