Track2Act: Predicting Point Tracks from Internet Videos enables Diverse Zero-shot Robot Manipulation
CoRR(2024)
摘要
We seek to learn a generalizable goal-conditioned policy that enables
zero-shot robot manipulation: interacting with unseen objects in novel scenes
without test-time adaptation. While typical approaches rely on a large amount
of demonstration data for such generalization, we propose an approach that
leverages web videos to predict plausible interaction plans and learns a
task-agnostic transformation to obtain robot actions in the real world. Our
framework,Track2Act predicts tracks of how points in an image should move in
future time-steps based on a goal, and can be trained with diverse videos on
the web including those of humans and robots manipulating everyday objects. We
use these 2D track predictions to infer a sequence of rigid transforms of the
object to be manipulated, and obtain robot end-effector poses that can be
executed in an open-loop manner. We then refine this open-loop plan by
predicting residual actions through a closed loop policy trained with a few
embodiment-specific demonstrations. We show that this approach of combining
scalably learned track prediction with a residual policy requiring minimal
in-domain robot-specific data enables zero-shot robot manipulation, and present
a wide array of real-world robot manipulation results across unseen tasks,
objects, and scenes. https://homangab.github.io/track2act/
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要