Hyperbolic Learning with Synthetic Captions for Open-World Detection
CVPR 2024(2024)
摘要
Open-world detection poses significant challenges, as it requires the
detection of any object using either object class labels or free-form texts.
Existing related works often use large-scale manual annotated caption datasets
for training, which are extremely expensive to collect. Instead, we propose to
transfer knowledge from vision-language models (VLMs) to enrich the
open-vocabulary descriptions automatically. Specifically, we bootstrap dense
synthetic captions using pre-trained VLMs to provide rich descriptions on
different regions in images, and incorporate these captions to train a novel
detector that generalizes to novel concepts. To mitigate the noise caused by
hallucination in synthetic captions, we also propose a novel hyperbolic
vision-language learning approach to impose a hierarchy between visual and
caption embeddings. We call our detector “HyperLearner”. We conduct extensive
experiments on a wide variety of open-world detection benchmarks (COCO, LVIS,
Object Detection in the Wild, RefCOCO) and our results show that our model
consistently outperforms existing state-of-the-art methods, such as GLIP,
GLIPv2 and Grounding DINO, when using the same backbone.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要