Automated Child Voice Generation: Methodology and Implementation

2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)(2023)

Cited 0|Views9
No score
Abstract
Significant progress has been made in the development of text-to-speech (TTS) models; however, synthesizing child speech remains a challenging task. Limited research has been conducted on this topic due to the lack of child speech datasets and the inherent difficulties in constructing such datasets. Children's speech is often less clear and exhibits significant variations in terms of volume, pitch, and rhythm. In this study, we explore the use of two different vocoders for synthesizing conversational multi-speaker child speech: the WORLD vocoder based on statistical parametric speech synthesis (SPSS) and the neural vocoders based on Parallel WaveGAN and AutoVocoder. Initially, we trained the AutoVocoder on a dataset of female adult speech. Subsequently, we investigated the effectiveness of fine-tuning and adapting these vocoders to capture the unique characteristics of child speech, while mitigating the need for extensive child speech datasets. Experimental results demonstrated that the AutoVocoder outperformed other vocoders in terms of clarity when synthesizing conversational multi-speaker child speech. Despite the challenges posed by the MyST child datasets used in this study, which included non-phonetic noise and indiscernible speech, the AutoVocoder significantly improved the quality and clarity of the ground truth in the context of conversational multi-speaker child speech synthesis. Both objective and subjective evaluations indicated that the original and synthesized speech by the AutoVocoder were very similar to each other.
More
Translated text
Key words
Text-To-Speech,Child TTS system,Parallel WaveGan,SPSS World Vocoder,AutoVocoder
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined