Open-World Human-Object Interaction Detection via Multi-modal Prompts
CVPR 2024(2024)
摘要
In this paper, we develop MP-HOI, a powerful Multi-modal
Prompt-based HOI detector designed to leverage both textual descriptions for
open-set generalization and visual exemplars for handling high ambiguity in
descriptions, realizing HOI detection in the open world. Specifically, it
integrates visual prompts into existing language-guided-only HOI detectors to
handle situations where textual descriptions face difficulties in
generalization and to address complex scenarios with high interaction
ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset
named Magic-HOI, which gathers six existing datasets into a unified label
space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI
interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI
dataset, we introduce an automated pipeline for generating realistically
annotated HOI images and present SynHOI, a high-quality synthetic HOI dataset
containing 100K images. Leveraging these two datasets, MP-HOI optimizes the HOI
task as a similarity learning process between multi-modal prompts and
objects/interactions via a unified contrastive loss, to learn generalizable and
transferable objects/interactions representations from large-scale data. MP-HOI
could serve as a generalist HOI detector, surpassing the HOI vocabulary of
existing expert models by more than 30 times. Concurrently, our results
demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world
scenarios and consistently achieves a new state-of-the-art performance across
various benchmarks.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要