Scene Graph Generation using Depth-based Multimodal Network.

ICME(2023)

引用 0|浏览3
暂无评分
摘要
Scene graph generation (SGG) provides an efficient way for scene understanding. However, it has been plagued by the inaccurate classification of relative spatial relationship and incorrect feature information aggregation from distant objects. In this paper, we innovatively introduce the depth information of objects into SGG and propose a multimodal edge-featured graph attention network (MEGA-Net). MEGA-Net primarily comprises three modules. First, the edge-aware message passing (EMP) module extracts multimodal features and fuses them as edge features in the graph network via a quadrilinear model. Multimodal features consist of depth features, visual features, spatial features, and linguistic features. The depth feature in EMP provides the relative spatial relationship among objects which prevents the tail spatial predicates from being recognized as the head predicates. Second, we propose a depth-based self-supervised graph attention (DSGAT) module to predict the correlation probability between object pairs. By encoding the depth ranking of different object pairs in 2D images, DSGAT learns more accurate directional attention to avoid unrelated neighbors. Third, we introduce a predicate aware loss (PA-Loss) to alleviate the feature redundancy problem caused by extra depth information. This is achieved by introducing semantic frequency information that reflects the priority between different types of relationships. Systematic experiments show that our method achieves state-of-the-art performance on two popular datasets, VG and VRD.
更多
查看译文
关键词
Scene Graph Generation, Depth Information, Self-Supervised Graph Attention Network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要