When CLIP meets cross-modal hashing retrieval: A new strong baseline

Xinyu Xia,Guohua Dong,Fengling Li,Lei Zhu,Xiaomin Ying

Inf. Fusion（2023）

引用 1|浏览5

暂无评分

摘要

Recent days witness significant progress in various multi-modal tasks made by Contrastive Language-Image Pre-training (CLIP), a multi-modal large-scale model that learns visual representations from natural language supervision. However, the potential effects of CLIP on cross-modal hashing retrieval has not been investigated yet. In this paper, we for the first time explore the effects of CLIP on cross-modal hashing retrieval performance and propose a simple but strong baseline Unsupervised Contrastive Multi-modal Fusion Hashing network (UCMFH). We first extract the off-the-shelf visual and linguistic features from the CLIP model, as the input sources for cross-modal hashing functions. To further mitigate the semantic gap between the image and text features, we design an effective contrastive multi-modal learning module that leverages a multi modal fusion transformer encoder supervising by a contrastive loss, to enhance modality interaction while improving the semantic representation of each modality. Furthermore, we design a contrastive hash learning module to produce high-quality modal-correlated hash codes. Experiments show that significant performance improvement can be made by our simple new unsupervised baseline UCMFH compared with state-of-the-art supervised and unsupervised cross-modal hashing methods. Also, our experiments demonstrate the remarkable performance of CLIP features on cross-modal hashing retrieval task compared to deep visual and linguistic features used in existing state-of-the-art methods. The source codes for our approach is publicly available at: https://github.com/XinyuXia97/UCMFH.

查看译文

关键词

Cross-modal retrieval,Hashing,CLIP,Modality fusion,Contrastive learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要