M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition
CoRR(2024)
摘要
Recently, the rise of large-scale vision-language pretrained models like
CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has
captured substantial attraction in video action recognition. Nevertheless,
prevailing approaches tend to prioritize strong supervised performance at the
expense of compromising the models' generalization capabilities during
transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP
adapting framework named to address these challenges, preserving both
high supervised performance and robust transferability. Firstly, to enhance the
individual modality architectures, we introduce multimodal adapters to both the
visual and text branches. Specifically, we design a novel visual TED-Adapter,
that performs global Temporal Enhancement and local temporal Difference
modeling to improve the temporal representation capabilities of the visual
encoder. Moreover, we adopt text encoder adapters to strengthen the learning of
semantic label information. Secondly, we design a multi-task decoder with a
rich set of supervisory signals to adeptly satisfy the need for strong
supervised performance and generalization within a multimodal framework.
Experimental results validate the efficacy of our approach, demonstrating
exceptional performance in supervised learning while maintaining strong
generalization in zero-shot scenarios.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要