GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval

IR(2021)

引用 23|浏览44
暂无评分
摘要
ABSTRACTGiven a text/image query, image-text retrieval aims to find the relevant items in the database. Recently, visual-linguistic pre-training (VLP) methods have demonstrated promising accuracy on image-text retrieval and other visual-linguistic tasks. These VLP methods are typically pre-trained on a large amount of image-text pairs, then fine-tuned on various downstream tasks. Nevertheless, due to the natural modality incompleteness in image-text retrieval, i.e., the query is either image or text rather than an image-text pair, the naive application of VLP to image-text retrieval results in significant inefficiency. Moreover, existing VLP methods cannot extract comparable representations for a single-modal query and multi-modal database items. In this work, we propose a generative visual-linguistic pre-training approach, termed as GilBERT, to simultaneously learn generic representations of image-text data and complete the missing modality for incomplete pairs. In testing phase, the proposed GilBERT facilitates efficient vector-based retrieval by providing unified feature embedding for query and database items. Moreover, the generative training not only makes GilBERT compatible with non-parallel text/image corpus, but also enables GilBERT to model the image-text relationships without suffering massive randomly-sampled negative samples, leading to superior experimental performances. Extensive experiments demonstrate the advantages of GilBERT in image-text retrieval, in terms of both efficiency and accuracy.
更多
查看译文
关键词
Image-text retrieval, Visual-linguistic pre-training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要