Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

Takaaki Saeki,Gary Wang,Nobuyuki Morioka,Isaac Elias,Kyle Kastner,Andrew Rosenberg,Bhuvana Ramabhadran,Heiga Zen,Françoise Beaufays, Hadar Shemtov

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)（2024）

引用 0|浏览7

暂无评分

摘要

Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10 just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1 that match the ground-truth in several languages.

查看译文

关键词

Speech Synthesis,Joint Speech-Text Models,Unsupervised Learning,Multilingual Modeling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要