CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
arxiv(2024)
摘要
Text-to-image retrieval aims to find the relevant images based on a text
query, which is important in various use-cases, such as digital libraries,
e-commerce, and multimedia databases. Although Multimodal Large Language Models
(MLLMs) demonstrate state-of-the-art performance, they exhibit limitations in
handling large-scale, diverse, and ambiguous real-world needs of retrieval, due
to the computation cost and the injective embeddings they produce. This paper
presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework,
designed for fast and effective large-scale long-text to image retrieval. The
first stage, Entity-based Ranking (ER), adapts to long-text query ambiguity by
employing a multiple-queries-to-multiple-targets paradigm, facilitating
candidate filtering for the next stage. The second stage, Summary-based
Re-ranking (SR), refines these rankings using summarized queries. We also
propose a specialized Decoupling-BEiT-3 encoder, optimized for handling
ambiguous user needs and both stages, which also enhances computational
efficiency through vector-based similarity inference. Evaluation on the AToMiC
dataset reveals that CFIR surpasses existing MLLMs by up to 11.06
Recall@1000, while reducing training and retrieval times by 68.75
respectively. We will release our code to facilitate future research at
https://github.com/longkukuhi/CFIR.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要