Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
CoRR(2024)
摘要
Visual grounding (VG) aims at locating the foreground entities that match the
given natural language expression. Previous datasets and methods for classic VG
task mainly rely on the prior assumption that the given expression must
literally refer to the target object, which greatly impedes the practical
deployment of agents in real-world scenarios. Since users usually prefer to
provide the intention-based expressions for the desired object instead of
covering all the details, it is necessary for the agents to interpret the
intention-driven instructions. Thus, in this work, we take a step further to
the intention-driven visual-language (V-L) understanding. To promote classic VG
towards human intention interpretation, we propose a new intention-driven
visual grounding (IVG) task and build a largest-scale IVG dataset named
IntentionVG with free-form intention expressions. Considering that practical
agents need to move and find specific targets among various scenarios to
realize the grounding task, our IVG task and IntentionVG dataset have taken the
crucial properties of both multi-scenario perception and egocentric view into
consideration. Besides, various types of models are set up as the baselines to
realize our IVG task. Extensive experiments on our IntentionVG dataset and
baselines demonstrate the necessity and efficacy of our method for the V-L
field. To foster future research in this direction, our newly built dataset and
baselines will be publicly available.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要