A Bottleneck Auto-Encoder for F0 Transformations on Speech and Singing Voice

INFORMATION(2022)

引用 2|浏览1
暂无评分
摘要
In this publication, we present a deep learning-based method to transform the f(0) in speech and singing voice recordings. f(0) transformation is performed by training an auto-encoder on the voice signal's mel-spectrogram and conditioning the auto-encoder on the f(0). Inspired by AutoVC/F0, we apply an information bottleneck to it to disentangle the f(0) from its latent code. The resulting model successfully applies the desired f(0) to the input mel-spectrograms and adapts the speaker identity when necessary, e.g., if the requested f(0) falls out of the range of the source speaker/singer. Using the mean f(0) error in the transformed mel-spectrograms, we define a disentanglement measure and perform a study over the required bottleneck size. The study reveals that to remove the f(0) from the auto-encoder's latent code, the bottleneck size should be smaller than four for singing and smaller than nine for speech. Through a perceptive test, we compare the audio quality of the proposed auto-encoder to f(0) transformations obtained with a classical vocoder. The perceptive test confirms that the audio quality is better for the auto-encoder than for the classical vocoder. Finally, a visual analysis of the latent code for the two-dimensional case is carried out. We observe that the auto-encoder encodes phonemes as repeated discontinuous temporal gestures within the latent code.
更多
查看译文
关键词
convolutional neural networks, attribute transformation, f(0) transformation, voice conversion, auto-encoder
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要