Objective evaluation of the DNN-based dialog speech synthesizer with dimensional control of emotion ∗ ☆

Masaki Yokoyama,Tomohiro Nagata,Hiroki Mori

semanticscholar(2019)

Cited 0|Views0
No score
Abstract
To make communication between human and machine closer to one between humans, speech synthesis should be able to express emotions and attitudes of speakers as paralinguistic information. Most studies on speech synthesis considering emotions are based on the basic emotion theory [1]. However, human emotions are not so simple as to be explained by basic emotions alone. One of the description methods of emotion is the dimensions [2]. With this method emotion can be described in more detail than with emotion categories. Most, speech corpora used in speech synthesis research is read-style [3]. However, speech in conversation is also different from read speech in that it conveys speaker’s emotion and attitude. For this reason, we believe that read speech corpora is inadequate for reproducing human communication by speech synthesis, which leads us to investigate speech synthesis using natural dialog speech corpus. Previously, we studied a dialog speech synthesis based on multiple-regression hidden semi-Markov model (MRHSMM) [4]. Although the MRHSMM enabled to control paralinguistic information in the form of dimensions such as pleasant-unpleasant, aroused-sleepy, etc., synthesized speech tended to have extreme parameters due to badly estimated regression matrices. We have shown that MAP estimation of regression matrices was effective to reduce the overfitting problem [4]. However, the problem still remains for certain combinations of given input of paralinguistic information. In recent years, on the other hand, neural networks (NN) took place of HMMs for modeling context-dependent acoustic parameters for speech synthesis [5]. Incorporating neural network is expected to improve the quality of synthetic speech, as well as the controllability of paralinguistic information. In this paper, we propose a method of controlling paralinguistic information in neural network-based dialog speech synthesis. And then, we studied syn∗感情の次元制御による DNN対話音声合成の客観評価 , 横山雅季, 永田智洋, 森大毅 (宇都宮大) thetic speech by objective evaluation.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined