Affordance-Guided Reinforcement Learning via Visual Prompting
arxiv(2024)
摘要
Robots equipped with reinforcement learning (RL) have the potential to learn
a wide range of skills solely from a reward signal. However, obtaining a robust
and dense reward signal for general manipulation tasks remains a challenge.
Existing learning-based approaches require significant data, such as
demonstrations or examples of success and failure, to learn task-specific
reward functions. Recently, there is also a growing adoption of large
multi-modal foundation models for robotics. These models can perform visual
reasoning in physical contexts and generate coarse robot motions for various
manipulation tasks. Motivated by this range of capability, in this work, we
propose and study rewards shaped by vision-language models (VLMs).
State-of-the-art VLMs have demonstrated an impressive ability to reason about
affordances through keypoints in zero-shot, and we leverage this to define
dense rewards for robotic learning. On a real-world manipulation task specified
by natural language description, we find that these rewards improve the sample
efficiency of autonomous RL and enable successful completion of the task in 20K
online finetuning steps. Additionally, we demonstrate the robustness of the
approach to reductions in the number of in-domain demonstrations used for
pretraining, reaching comparable performance in 35K online finetuning steps.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要