Language is Strong, Vision is Not: A Diagnostic Study of the Limitations of the Embodied Question Answering Task

Ankesh Anand,Eugene Belilovsky,Kyle Kastner,Dhruv Batra,Aaron Gokaslan,Aniruddha Kembhavi

semanticscholar（2022）

Cited 0|Views0

No score

Abstract

We examine the limitations of the Embodied 001 Question Answering (EQA) task, the dataset 002 and the models (Das et al., 2018). We observe 003 that the role of vision in EQA is small, and the 004 models often exploit language biases found in 005 the dataset. We demonstrate that perturbing 006 vision at different levels (incongruent, black or 007 random noise images) still allows the models to 008 learn from general visual patterns, suggesting 009 that they capture some common sense reason- 010 ing about the visual world. We argue that a 011 better set of data and models are required to 012 achieve better performance in predicting (gen- 013 erating) correct answers. We make the code 014 used in the experiments available here: [the 015 GitHub link placeholder]. 016

Translated text

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined