Learning to Prompt CLIP for Monocular Depth Estimation: Exploring the Limits of Human Language

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW(2023)

引用 0|浏览0
暂无评分
摘要
CLIP is a significant vision-and-language training framework that has shown surprisingly general understanding of the world, with good performance in many open-ended tasks with little or no additional training. A recent technique has used CLIP to perform 0-shot Monocular Depth Estimation (MDE) by using depth-related prompts, but the use of human language in these prompts presents an unnecessary human bias. In this work, we use continuous learnable tokens in place of discrete human-language words to shed light on the problem. We achieve a significant boost in performance, and find that the learned tokens do not map neatly to depth-related human language, implying that CLIP's concept of depth is not succinctly expressible in human language. We posit that this may extend to other CLIP concepts, and believe that this finding will spark further research into both the use and interpretation of non-linguistic tokens in all open-ended scene interpretation tasks. Code is available at https://github.com/DylanAuty/PromptLearningCLIP-MDE
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要