Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning
arxiv(2023)
摘要
Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is
often necessary to optimize their performance. However, a major obstacle is the
limited availability of labeled data. We study the use of pseudolabels, i.e.,
heuristic labels for unlabeled data, to enhance CLIP via prompt tuning.
Conventional pseudolabeling trains a model on labeled data and then generates
labels for unlabeled data. VLMs' zero-shot capabilities enable a "second
generation" of pseudolabeling approaches that do not require task-specific
training on labeled data. By using zero-shot pseudolabels as a source of
supervision, we observe that learning paradigms such as semi-supervised,
transductive zero-shot, and unsupervised learning can all be seen as optimizing
the same loss function. This unified view enables the development of versatile
training strategies that are applicable across learning paradigms. We
investigate them on image classification tasks where CLIP exhibits limitations,
by varying prompt modalities, e.g., textual or visual prompts, and learning
paradigms. We find that (1) unexplored prompt tuning strategies that
iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5
points in semi-supervised learning, by 28.4 points in transductive zero-shot
learning, and by 15.2 points in unsupervised learning, and (2) unlike
conventional semi-supervised pseudolabeling, which exacerbates model biases
toward classes with higher-quality pseudolabels, prompt tuning leads to a more
equitable distribution of per-class accuracy. The code to reproduce the
experiments is at https://github.com/BatsResearch/menghini-neurips23-code.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要