Transformer-Based Relational Inference Network for Complex Visual Relational Reasoning

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS(2024)

引用 0|浏览7
暂无评分
摘要
Visual Relational Reasoning is the basis of many vision-and-language based tasks (e.g., visual question answering and referring expression comprehension). In this article, we regard the complex referring expression comprehension (c-REF) task as the reasoning basis, in which c-REF seeks to localise a target object in an image guided by a complex query. Such queries often contain complex logic and thus impose two critical challenges for reasoning: (i) Comprehending the complex queries is difficult since these queries usually refer to multiple objects and their relationships; (ii) Reasoning among multiple objects guided by the queries and then localising the target correctly are non-trivial. To address the above challenges, we propose a Transformer-based Relational Inference Network (Trans-RINet). Specifically, to comprehend the queries, we mimic the language-comprehending mechanism of humans, and devise a language decomposition module to decompose the queries into four types, i.e., basic attributes, absolute location, visual relationship and relative location. We further devise four modules to address the corresponding information. In each module, we consider the intra-(i.e., between the objects) and inter-modality relationships(i.e., between the queries and objects) to improve the reasoning ability. Moreover, we construct a relational graph to represent the objects and their relationships, and devise a multi-step reasoning method to progressively understand the complex logic. Since each type of the queries is closely related, we let each module interact with each other before making a decision. Extensive experiments on the CLEVR-Ref+, Ref-Reasoning, and CLEVR-CoGenT datasets demonstrate the superior reasoning performance of our Trans-RINet.
更多
查看译文
关键词
Visual Relational Reasoning,complex referring expression comprehension,Gated Graph Neural Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要