Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
CoRR(2024)
Abstract
Adapter-based parameter-efficient transfer learning has achieved exciting
results in vision-language models. Traditional adapter methods often require
training or fine-tuning, facing challenges such as insufficient samples or
resource limitations. While some methods overcome the need for training by
leveraging image modality cache and retrieval, they overlook the text
modality's importance and cross-modal cues for the efficient adaptation of
parameters in visual-language models. This work introduces a cross-modal
parameter-efficient approach named XMAdapter. XMAdapter establishes cache
models for both text and image modalities. It then leverages retrieval through
visual-language bimodal information to gather clues for inference. By
dynamically adjusting the affinity ratio, it achieves cross-modal fusion,
decoupling different modal similarities to assess their respective
contributions. Additionally, it explores hard samples based on differences in
cross-modal affinity and enhances model performance through adaptive adjustment
of sample learning intensity. Extensive experimental results on benchmark
datasets demonstrate that XMAdapter outperforms previous adapter-based methods
significantly regarding accuracy, generalization, and efficiency.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined