VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness
arxiv(2024)
Abstract
Finetuning a pretrained vision model (PVM) is a common technique for learning
downstream vision tasks. However, the conventional finetuning process with
randomly sampled data points results in diminished training efficiency. To
address this drawback, we propose a novel approach, Vision-language
Collaborative Active Finetuning (VeCAF). With the emerging availability of
labels and natural language annotations of images through web-scale crawling or
controlled generation, VeCAF makes use of these information to perform
parametric data selection for PVM finetuning. VeCAF incorporates the finetuning
objective to select significant data points that effectively guide the PVM
towards faster convergence to meet the performance goal. This process is
assisted by the inherent semantic richness of the text embedding space which we
use to augment image features. Furthermore, the flexibility of text-domain
augmentation allows VeCAF to handle out-of-distribution scenarios without
external data. Extensive experiments show the leading performance and high
computational efficiency of VeCAF that is superior to baselines in both
in-distribution and out-of-distribution image classification tasks. On
ImageNet, VeCAF uses up to 3.3x less training batches to reach the target
performance compared to full finetuning, and achieves an accuracy improvement
of 2.7
of batches.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined