Vision-Language Models Provide Promptable Representations for Reinforcement Learning
CoRR(2024)
摘要
Humans can quickly learn new behaviors by leveraging background world
knowledge. In contrast, agents trained with reinforcement learning (RL)
typically learn behaviors from scratch. We thus propose a novel approach that
uses the vast amounts of general and indexable world knowledge encoded in
vision-language models (VLMs) pre-trained on Internet-scale data for embodied
RL. We initialize policies with VLMs by using them as promptable
representations: embeddings that are grounded in visual observations and encode
semantic features based on the VLM's internal knowledge, as elicited through
prompts that provide task context and auxiliary information. We evaluate our
approach on visually-complex, long horizon RL tasks in Minecraft and robot
navigation in Habitat. We find that our policies trained on embeddings
extracted from general-purpose VLMs outperform equivalent policies trained on
generic, non-promptable image embeddings. We also find our approach outperforms
instruction-following methods and performs comparably to domain-specific
embeddings.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要