Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis
CoRR(2024)
摘要
Recent language model-based text-to-speech (TTS) frameworks demonstrate
scalability and in-context learning capabilities. However, they suffer from
robustness issues due to the accumulation of errors in speech unit predictions
during autoregressive language modeling. In this paper, we propose a phonetic
enhanced language modeling method to improve the performance of TTS models. We
leverage self-supervised representations that are phonetically rich as the
training target for the autoregressive language model. Subsequently, a
non-autoregressive model is employed to predict discrete acoustic codecs that
contain fine-grained acoustic details. The TTS model focuses solely on
linguistic modeling during autoregressive training, thereby reducing the error
propagation that occurs in non-autoregressive training. Both objective and
subjective evaluations validate the effectiveness of our proposed method.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要