What does CLIP know about peeling a banana?
arxiv(2024)
摘要
Humans show an innate capability to identify tools to support specific
actions. The association between objects parts and the actions they facilitate
is usually named affordance. Being able to segment objects parts depending on
the tasks they afford is crucial to enable intelligent robots to use objects of
daily living. Traditional supervised learning methods for affordance
segmentation require costly pixel-level annotations, while weakly supervised
approaches, though less demanding, still rely on object-interaction examples
and support a closed set of actions. These limitations hinder scalability, may
introduce biases, and usually restrict models to a limited set of predefined
actions. This paper proposes AffordanceCLIP, to overcome these limitations by
leveraging the implicit affordance knowledge embedded within large pre-trained
Vision-Language models like CLIP. We experimentally demonstrate that CLIP,
although not explicitly trained for affordances detection, retains valuable
information for the task. Our AffordanceCLIP achieves competitive zero-shot
performance compared to methods with specialized training, while offering
several advantages: i) it works with any action prompt, not just a predefined
set; ii) it requires training only a small number of additional parameters
compared to existing solutions and iii) eliminates the need for direct
supervision on action-object pairs, opening new perspectives for
functionality-based reasoning of models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要