ADVL: Adaptive Distillation for Vision-Language Tasks

Zhecan Wang,Noel C Codella,Haoxuan You,Long Chen,Yen-Chun Chen,Yulei Niu,Jianwei Yang,Luowei Zhou,Lu Yuan,Kai-Wei Chang,Shih-Fu Chang

ICLR 2023（2023）

引用 0|浏览137

暂无评分

摘要

Large-scale image-text pairs, such as image-captions and image-phrases, enable the strong representation of vision-language (VL) models. Nevertheless, they lose diversity and complexity due to the constraints in collecting data. Meanwhile, models pre-trained with image-only or text-only data (we call them unimodal pretrained models) continue to flourish and impress the community. Compared to image-text pairs, unimodal data has less constraints during the collection process resulting in more diverse styles. A natural question is how to leverage unimodal pretrained models to benefit downstream VL tasks? Most existing works focus on fusing VL information in the expensive pre-training stage. They directly plug in unimodal pre-trained encoders into a VL framework and redo an additional pre-training step on paired image-text data. This causes additional computation expense and the unimodal pretrained knowledge might be forgotten. In this paper, we take a different route and investigate how to fuse VL information in the finetuning stage oaly. To directly transfer pretrained knowledge from unimodal models to belp downstream VL tasks, we propose $\mathrm{ADVL}$, which avoids redoing any pre-training step and is generalizable to be applied of top of various VL models. To comprehensively demonstrate the effectiveness of ADVL, we conduct evaluation across three mostly recognized highly semantic VL benchmarks: VCR, VQA, and SNLI-VE under three settings, low-shot, full-shot and domainshifted settings. Results show that ADVL consistently improves the performance with different VL base models across all settings. It even achieves state-of-theart (SOTA) performance on VCR among models pre-trained with image-text data and delivers competitive results on VQA and SNLI-VE, Based on our analysis, we also discover that ADVL can improve the robustness of VL models and regulate them to better use vision information.

查看译文

关键词

vision language,kowledge distillation,vcr,vqa,snli-ve,visual question answering,commonsense reasoning,pretraining,multimodal,robust,low-shot,zero-shot,domain-shift,debiased

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要