Estimating Mutual Information in Prosody Representation for Emotional Prosody Transfer in Speech Synthesis

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)(2021)

引用 12|浏览11
暂无评分
摘要
An end-to-end prosody transfer system aims to transfer the speech prosody from one speaker to another speaker. One major application is the generation of emotional speech with a new speaker's voice. The end-to-end system uses an intermediate representation of prosody, which encompasses both speaker and emotion related information. The present study tackles the problem of estimating the mutual information between emotion and speaker-related factors in the prosody representation. A mutual information neural estimator (MINE) which could measure the mutual information between high-dimensional continuous prosody embedding and discrete speaker/emotion label is applied. The experimental results show that: 1) the prosody representation generated by the end-to-end system indeed contains both emotion and speaker information; 2) The mutual information would be determined by the type of input acoustic features to the reference encoder; 3) normalization for the log F0 feature is very effective in increasing emotion-related information in the prosody representation; 4) adversarial learning can be applied to reduce speaker information in the prosody representation. These results are useful to the further development of an optimal and practical emotional prosody transfer systems.
更多
查看译文
关键词
mutual information,emotion transfer,prosody representation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要