Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS(2024)

引用 0|浏览2
暂无评分
摘要
Visual grounding is an essential task in understanding the semantic relationship between the given text description and the target object in an image. Due to the innate complexity of language and the rich semantic context of the image, it is still a challenging problem to infer the underlying relationship and to perform reasoning between the objects in an image and the given expression. Although existing visual grounding methods have achieved promising progress, cross-modal mapping across different domains for the task is still not well handled, especially when the expressions are complex and long. To address the issue, we propose a language-guided residual graph attention network for visual grounding (LRGAT-VG), which enables us to apply deeper graph convolution layers with the assistance of residual connections between them. This allows us to better handle long and complex expressions than other graph-based methods. Furthermore, we perform a Language-guided Data Augmentation (LGDA), which is based on copy-paste operations on pairs of source and target images to increase the diversity of training data while maintaining the relationship between the objects in the image and the expression. With extensive experiments on three visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, LRGAT-VG with LGDA achieves competitive performance with other state-of-the-art graph network-based referring expression approaches and demonstrates its effectiveness.
更多
查看译文
关键词
Residual graph attention network,data augmentation,visual grounding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要