Understanding Visual Concepts Across Models
arxiv(2024)
摘要
Large multimodal models such as Stable Diffusion can generate, detect, and
classify new visual concepts after fine-tuning just a single word embedding. Do
models learn similar words for the same concepts (i.e. = orange +
cat)? We conduct a large-scale analysis on three state-of-the-art models in
text-to-image generation, open-set object detection, and zero-shot
classification, and find that new word embeddings are model-specific and
non-transferable. Across 4,800 new embeddings trained for 40 diverse visual
concepts on four standard datasets, we find perturbations within an
ϵ-ball to any prior embedding that generate, detect, and classify an
arbitrary concept. When these new embeddings are spliced into new models,
fine-tuning that targets the original model is lost. We show popular soft
prompt-tuning approaches find these perturbative solutions when applied to
visual concept learning tasks, and embeddings for visual concepts are not
transferable. Code for reproducing our work is available at:
https://visual-words.github.io.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要