RoboDreamer: Learning Compositional World Models for Robot Imagination
arxiv(2024)
摘要
Text-to-video models have demonstrated substantial potential in robotic
decision-making, enabling the imagination of realistic plans of future actions
as well as accurate environment simulation. However, one major issue in such
models is generalization – models are limited to synthesizing videos subject
to language instructions similar to those seen at training time. This is
heavily limiting in decision-making, where we seek a powerful world model to
synthesize plans of unseen combinations of objects and actions in order to
solve previously unseen tasks in new environments. To resolve this issue, we
introduce RoboDreamer, an innovative approach for learning a compositional
world model by factorizing the video generation. We leverage the natural
compositionality of language to parse instructions into a set of lower-level
primitives, which we condition a set of models on to generate videos. We
illustrate how this factorization naturally enables compositional
generalization, by allowing us to formulate a new natural language instruction
as a combination of previously seen components. We further show how such a
factorization enables us to add additional multimodal goals, allowing us to
specify a video we wish to generate given both natural language instructions
and a goal image. Our approach can successfully synthesize video plans on
unseen goals in the RT-X, enables successful robot execution in simulation, and
substantially outperforms monolithic baseline approaches to video generation.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要