Vision-language navigation: a survey and taxonomy

Neural Computing and Applications(2024)

Cited 0|Views114
No score
Abstract
Vision-language navigation (VLN) tasks require an agent to follow language instructions from a human guide to navigate in previously unseen environments using visual observations. This challenging field, involving problems in natural language processing (NLP), computer vision (CV), robotics, etc., has spawned many excellent works focusing on various VLN tasks. This paper provides a comprehensive survey and an insightful taxonomy of these tasks based on the different characteristics of language instructions. Depending on whether navigation instructions are given once or multiple times, we divide the tasks into two categories, i.e., single-turn and multiturn tasks. We subdivide single-turn tasks into goal-oriented and route-oriented tasks based on whether the instructions designate a single goal location or specify a sequence of multiple locations. We subdivide multiturn tasks into interactive and passive tasks based on whether the agent is allowed to ask questions. These tasks require different agent capabilities and entail various model designs. We identify the progress made on these tasks and examine the limitations of the existing VLN models and task settings. Hopefully, a well-designed taxonomy of the task family enables comparisons among different approaches across papers concerning the same tasks and clarifies the advances made in these tasks. Furthermore, we discuss several open issues in this field and some promising directions for future research, including the incorporation of knowledge into VLN models and transferring them to the real physical world.
More
Translated text
Key words
Vision-language navigation,Taxonomy,Multimodal,Neural networks
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined