Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)
摘要
Collecting high-quality studio recordings of audio is challenging, which
limits the language coverage of text-to-speech (TTS) systems. This paper
proposes a framework for scaling a multilingual TTS model to 100+ languages
using found data without supervision. The proposed framework combines
speech-text encoder pretraining with unsupervised training using untranscribed
speech and unspoken text data sources, thereby leveraging massively
multilingual joint speech and text representation learning. Without any
transcribed speech in a new language, this TTS model can generate intelligible
speech in >30 unseen languages (CER difference of <10
just 15 minutes of transcribed, found data, we can reduce the intelligibility
difference to 1
that match the ground-truth in several languages.
更多查看译文
关键词
Speech Synthesis,Joint Speech-Text Models,Unsupervised Learning,Multilingual Modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要