Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs
arxiv(2024)
摘要
An elusive goal in navigation research is to build an intelligent agent that
can understand multimodal instructions including natural language and image,
and perform useful navigation. To achieve this, we study a widely useful
category of navigation tasks we call Multimodal Instruction Navigation with
demonstration Tours (MINT), in which the environment prior is provided through
a previously recorded demonstration video. Recent advances in Vision Language
Models (VLMs) have shown a promising path in achieving this goal as it
demonstrates capabilities in perceiving and reasoning about multimodal inputs.
However, VLMs are typically trained to predict textual output and it is an open
research question about how to best utilize them in navigation. To solve MINT,
we present Mobility VLA, a hierarchical Vision-Language-Action (VLA) navigation
policy that combines the environment understanding and common sense reasoning
power of long-context VLMs and a robust low-level navigation policy based on
topological graphs. The high-level policy consists of a long-context VLM that
takes the demonstration tour video and the multimodal user instruction as input
to find the goal frame in the tour video. Next, a low-level policy uses the
goal frame and an offline constructed topological graph to generate robot
actions at every timestep. We evaluated Mobility VLA in a 836m^2 real world
environment and show that Mobility VLA has a high end-to-end success rates on
previously unsolved multimodal instructions such as "Where should I return
this?" while holding a plastic bin.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要