Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval
CoRR(2024)
摘要
Voice conversion refers to transferring speaker identity with well-preserved
content. Better disentanglement of speech representations leads to better voice
conversion. Recent studies have found that phonetic information from input
audio has the potential ability to well represent content. Besides, the
speaker-style modeling with pre-trained models making the process more complex.
To tackle these issues, we introduce a new method named "CTVC" which utilizes
disentangled speech representations with contrastive learning and
time-invariant retrieval. Specifically, a similarity-based compression module
is used to facilitate a more intimate connection between the frame-level hidden
features and linguistic information at phoneme-level. Additionally, a
time-invariant retrieval is proposed for timbre extraction based on multiple
segmentations and mutual information. Experimental results demonstrate that
"CTVC" outperforms previous studies and improves the sound quality and
similarity of converted results.
更多查看译文
关键词
Voice Conversion,Speech Synthesis,Time-Invariant Retrieval,Contrastive Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要