A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
CoRR(2023)
摘要
Key to tasks that require reasoning about natural language in visual contexts
is grounding words and phrases to image regions. However, observing this
grounding in contemporary models is complex, even if it is generally expected
to take place if the task is addressed in a way that is conductive to
generalization. We propose a framework to jointly study task performance and
phrase grounding, and propose three benchmarks to study the relation between
the two. Our results show that contemporary models demonstrate inconsistency
between their ability to ground phrases and solve tasks. We show how this can
be addressed through brute-force training on ground phrasing annotations, and
analyze the dynamics it creates. Code and at available at
https://github.com/lil-lab/phrase_grounding.
更多查看译文
关键词
phrase grounding,language models,task performance,vision
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要