Neural Conversational Speech Synthesis with Flexible Control of Emotion Dimensions

Hiroki Mori, Hironao Nishino

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)(2022)

Cited 0|Views0
No score
Abstract
An end-to-end conversational speech synthesis that enables a flexible control of emotional states defined over emotion dimensions is proposed. The Tacotron 2 architecture is extended so as to receive the emotion dimensions as input. The model is first pre-trained using a large-scale spontaneous speech corpus, then fine-tuned using a natural dialogue speech corpus with manually annotated perceived emotion in a form of pleasantness and arousal. Since the corpus for pre-training does not have emotion information, we examined two pre-training & fine-tuning strategies, and showed that the one applying an emotion dimension estimator before the pre-training was superior. The result of subjective evaluation for the emotion controllability showed a correlation of R = 0.48 for pleasantness and R = 0.78 for arousal between given and perceived emotional state, indicating the effectiveness of the proposed conversational speech synthesis with emotion control.
More
Translated text
Key words
emotion control,emotion controllability,emotion dimension estimator,emotion dimensions,emotion information,emotional state,end-to-end conversational speech synthesis,flexible control,large scale spontaneous speech corpus,manually annotated perceived emotion,natural dialogue speech corpus,neural conversational speech synthesis,speech synthesis,Tacotron 2 architecture
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined