TSwinPose: Enhanced monocular 3D human pose estimation with JointFlow

Muyu Li, Henan Hu, Jingjing Xiong,Xudong Zhao,Hong Yan

Expert Systems with Applications(2024)

引用 0|浏览11
暂无评分
摘要
Monocular estimation of 3D human poses is challenging due to ambiguity in depths and partial occlusion. Most recent works define this as a 2D-to-3D lifting task, taking 2D key point sequences and using spatial and temporal relationships. However, prior works focus on capturing spatio-temporal correlations but ignore the motion of joints that is needed for continuous estimation. To extend the potential of 2D-to-3D pose estimation, we propose TSwinPose, which learns multi-scale spatio-temporal representations from 2D key point locations and patterns of motion. The input 2D key point sequences are enhanced by JointFlow, which encodes the motion of each human joint. Based on Swin-Transformer, we designed a temporal domain Swin-Unet structure to model multi-scale spatio-temporal relationships of human joints across different temporal windows. The final 3D pose generated by multi-stage representations is consistent temporally and has a higher accuracy. Experiments conducted on three benchmark datasets, Human3.6M, MPI-INF-3DHP, and HumanEva-I, demonstrate that TSwinPose achieves performance that is on par with state-of-the-art methods. Moreover, the introduction of JointFlow as a plug-in extension enhances performance significantly, particularly benefiting long-term 2D-to-3D lifting human pose estimation methods.
更多
查看译文
关键词
Monocular video,3D human pose estimation,Transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要