Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding

IEEE TRANSACTIONS ON IMAGE PROCESSING(2022)

Cited 18|Views67
No score
Abstract
Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully utilize the information in two modalities. Instead, it is more desired to perform cross-modal interaction during the extraction process of the visual and linguistic feature. In this paper, we propose a language-customized visual feature learning mechanism where linguistic information guides the extraction of visual feature from the very beginning. We instantiate the mechanism as a one-stage framework named Progressive Language-customized Visual feature learning (PLV). Our proposed PLV consists of a Progressive Language-customized Visual Encoder (PLVE) and a grounding module. We customize the visual feature with linguistic guidance at each stage of the PLVE by Channel-wise Language-guided Interaction Modules (CLIM). Our proposed PLV outperforms conventional state-of-the-art methods with large margins across five visual grounding datasets without pre-training on object detection datasets, while achieving real-time speed. The source code is available in the supplementary material.
More
Translated text
Key words
Visualization, Feature extraction, Grounding, Linguistics, Task analysis, Detectors, Representation learning, Visual grounding, referring expression comprehension, visual linguistic understanding, cross-modal fusion
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined