Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
CVPR 2024(2024)
摘要
Co-speech gestures, if presented in the lively form of videos, can achieve
superior visual effects in human-machine interaction. While previous works
mostly generate structural human skeletons, resulting in the omission of
appearance information, we focus on the direct generation of audio-driven
co-speech gesture videos in this work. There are two main challenges: 1) A
suitable motion feature is needed to describe complex human movements with
crucial appearance information. 2) Gestures and speech exhibit inherent
dependencies and should be temporally aligned even of arbitrary length. To
solve these problems, we present a novel motion-decoupled framework to generate
co-speech gesture videos. Specifically, we first introduce a well-designed
nonlinear TPS transformation to obtain latent motion features preserving
essential appearance information. Then a transformer-based diffusion model is
proposed to learn the temporal correlation between gestures and speech, and
performs generation in the latent motion space, followed by an optimal motion
selection module to produce long-term coherent and consistent gesture videos.
For better visual perception, we further design a refinement network focusing
on missing details of certain areas. Extensive experimental results show that
our proposed framework significantly outperforms existing approaches in both
motion and video-related evaluations. Our code, demos, and more resources are
available at https://github.com/thuhcsi/S2G-MDDiffusion.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要