Chrome Extension
WeChat Mini Program
Use on ChatGLM

BagFormer: Better cross-modal retrieval via bag-wise interaction

Haowen Hou, Xiaopeng Yan, Yigeng Zhang

Engineering Applications of Artificial Intelligence(2024)

Cited 0|Views13
No score
Abstract
In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we propose a dual encoder model called BagFormer that utilizes bag-wise late interaction mechanism to improve re-rank performance without sacrificing latency and throughput. BagFormer achieves this by employing a bagging layer, which facilitates the transformation of text to an appropriate granularity. This not only mitigates the issue of modal granularity mismatch but also enables the integration of entity knowledge into the model. Our experiments have shown that BagFormerViT-B outperforms the traditional dual-encoder model CLIPViT-B by 7.97% in zero-shot settings. Under fine-tuned conditions, BagFormerViT-B demonstrates an even more significant improvement of 17.98% over CLIPViT-B. Moreover, BagFormer not only matches the performance of cutting-edge single-encoder models in cross-modal retrieval tasks but also provides efficient inference processes characterized by lower latency and higher throughput. Compared to single-encoder models, BagFormer can achieve a speedup ratio of 38.14 when re-ranking individual candidates. Code and models are available at github.com/howard-hou/BagFormer.
More
Translated text
Key words
Cross-modal retrieval,Vision-language pre-training,Information retrieval,Image-to-text retrieval,Multi-modal learning
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined