Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
arxiv(2024)
摘要
Visual relationship detection aims to identify objects and their
relationships in images. Prior methods approach this task by adding separate
relationship modules or decoders to existing object detection architectures.
This separation increases complexity and hinders end-to-end training, which
limits performance. We propose a simple and highly efficient decoder-free
architecture for open-vocabulary visual relationship detection. Our model
consists of a Transformer-based image encoder that represents objects as tokens
and models their relationships implicitly. To extract relationship information,
we introduce an attention mechanism that selects object pairs likely to form a
relationship. We provide a single-stage recipe to train this model on a mixture
of object and relationship detection data. Our approach achieves
state-of-the-art relationship detection performance on Visual Genome and on the
large-vocabulary GQA benchmark at real-time inference speeds. We provide
analyses of zero-shot performance, ablations, and real-world qualitative
examples.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要