Vision and Language Navigation using Multi-head Attention Mechanism

Sai Mao,Junmin Wu, Siqi Hong

2020 6th International Conference on Big Data and Information Analytics (BigDIA)(2020)

引用 1|浏览10
暂无评分
摘要
In the wake of the developments of deep learning, more and more research focus on the intersection fields of natural language processing and machine vision, vision and language navigation (VLN) is one of them. The VLN task needs an embodied agent to carry out the natural language instruction and navigate inside a real 3D environment with the help of visual information, planning a trajectory from start point to goal location. In this paper, inspired by the previous works, we introduce a multi-head attention module with a parallel attention computing method which apply multi-head attention mechanism on visual and textual input to enhance the performance of the model. Specifically, first we design a multihead attention module with trainable parameters, which can extract associated attention from the textual and visual information, extracted attention variables are used to promote the agent aware of which part of the sentence or image is more important. Second, in order to help the agent perceives more useful information, we perform a parallel computing method on extracted attention and input, i.e. visual and textual input feature, then use a layer normalization to combine them. The results of experiments indicate our proposed module enables the model to obtain better performance and surpasses the baseline model. The success rate of our model is 51% and oracle success rate is 62% with low navigation error.
更多
查看译文
关键词
Visualization,Navigation,Computational modeling,Feature extraction,Data models,Data mining,Task analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要