Cross-lingual Text-to-Speech with Prosody Embedding.

IWSSIP(2023)

引用 1|浏览0
暂无评分
摘要
The research presented in the paper handles the problem of multilingual text-to-speech, particularly its capability of synthesis of speech when the appropriate combination of desired properties (speaker, language, speaking style) is missing from the training corpus. The model proposed in the paper achieves cross-lingual speech synthesis through the use of neural network embeddings, applied not only to speaker and speaking style IDs, but also to context-dependent phonemes and a range of prosodic events, including accents and phrase breaks. This allows the model to efficiently capture relationships between phones and prosodic events in different languages, and consequently to synthesize speech in the voice of a person who has never spoken the target language or used a target style. The proposed model was trained on speech corpora of American English and Serbo-Croatian. A range of experiments including subjective evaluation of synthesis was carried out to establish both the quality of synthesis in different scenarios and under different conditions, as well as the similarity of speaker voices between cross-lingual and original language scenario.
更多
查看译文
关键词
cross-lingual model,neural networks,prosody embedding,text-to-speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要