Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic Dubbing

CoRR(2024)

引用 0|浏览61
暂无评分
摘要
Automatic dubbing, which generates a corresponding version of the input speech in another language, can be widely utilized in many real-world scenarios, such as video and game localization. In addition to synthesizing the translated scripts, automatic dubbing further transfers the speaking style in the original language to the dubbed speeches to give audiences the impression that the characters are speaking in their native tongue. However, state-of-the-art automatic dubbing systems only model the transfer on the duration and speaking rate, disregarding the other aspects of speaking style, such as emotion, intonation and emphasis, which are also crucial to fully understand the characters and speech. In this paper, we propose a joint multiscale cross-lingual speaking style transfer framework to simultaneously model the bidirectional speaking style transfer between two languages at both the global scale (i.e., utterance level) and local scale (i.e., word level). The global and local speaking styles in each language are extracted and utilized to predict the global and local speaking styles in the other language with an encoder-decoder framework for each direction and a shared bidirectional attention mechanism for both directions. A multiscale speaking style-enhanced FastSpeech 2 is then utilized to synthesize the desired speech with the predicted global and local speaking styles for each language. The experimental results demonstrate the effectiveness of our proposed framework, which outperforms a baseline with only duration transfer in objective and subjective evaluations.
更多
查看译文
关键词
Automatic dubbing,cross-lingual speaking style transfer,multiscale speaking style transfer,bidirectional attention mechanism,text-to-speech synthesis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要