Voice Reenactment with F0 and timing constraints and adversarial learning of conversions

2022 30th European Signal Processing Conference (EUSIPCO)(2022)

引用 0|浏览5
暂无评分
摘要
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural-VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conversion. First, the S2S-VC architecture is modified to synchronize the converted speech with the source speech by phonetic duration encoding; second, the decoder is conditioned on the desired sequence of F0- values and an explicit F0-loss is formulated between the F0 of the source speaker and the F0 of the converted speech. Finally, adversarial learning of conversion is integrated within the S2S-VC architecture to exploit advantages of both reconstruction of original speech with ground truth and converted speech with manipulated attributes. An experimental evaluation on the VCTK speech database shows that the speech prosody can be efficiently preserved during conversion, and that the proposed adversarial learning consistently improves the conversion and the naturalness of the reenacted speech.
更多
查看译文
关键词
Voice conversion,voice reenactement,prosody preservation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要