Language is Strong, Vision is Not: A Diagnostic Study of the Limitations of the Embodied Question Answering Task

semanticscholar(2022)

Cited 0|Views0
No score
Abstract
We examine the limitations of the Embodied 001 Question Answering (EQA) task, the dataset 002 and the models (Das et al., 2018). We observe 003 that the role of vision in EQA is small, and the 004 models often exploit language biases found in 005 the dataset. We demonstrate that perturbing 006 vision at different levels (incongruent, black or 007 random noise images) still allows the models to 008 learn from general visual patterns, suggesting 009 that they capture some common sense reason- 010 ing about the visual world. We argue that a 011 better set of data and models are required to 012 achieve better performance in predicting (gen- 013 erating) correct answers. We make the code 014 used in the experiments available here: [the 015 GitHub link placeholder]. 016
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined