Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
arxiv(2024)
Abstract
Audio language models have recently emerged as a promising approach for
various audio generation tasks, relying on audio tokenizers to encode waveforms
into sequences of discrete symbols. Audio tokenization often poses a necessary
compromise between code bitrate and reconstruction accuracy. When dealing with
low-bitrate audio codes, language models are constrained to process only a
subset of the information embedded in the audio, which in turn restricts their
generative capabilities. To circumvent these issues, we propose encoding audio
as vector sequences in continuous space ℝ^d and autoregressively
generating these sequences using a decoder-only diffusion transformer (ARDiT).
Our findings indicate that ARDiT excels in zero-shot text-to-speech and
exhibits performance that compares to or even surpasses that of
state-of-the-art models. High-bitrate continuous speech representation enables
almost flawless reconstruction, allowing our model to achieve nearly perfect
speech editing. Our experiments reveal that employing Integral Kullback-Leibler
(IKL) divergence for distillation at each autoregressive step significantly
boosts the perceived quality of the samples. Simultaneously, it condenses the
iterative sampling process of the diffusion model into a single step.
Furthermore, ARDiT can be trained to predict several continuous vectors in one
step, significantly reducing latency during sampling. Impressively, one of our
models can generate 170 ms of 24 kHz speech per evaluation step with
minimal degradation in performance. Audio samples are available at
http://ardit-tts.github.io/ .
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined