谷歌浏览器插件
订阅小程序
在清言上使用

ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW(2023)

引用 0|浏览6
暂无评分
摘要
Traditional image-to-image and text-to-image search struggle with comprehending complex user intentions, particularly in fashion e-commerce, where users search for similar products with text modifications to a reference image. This paper introduces Progressive Vision-Language Alignment and Multimodal Fusion (ProVLA), a novel approach which utilizes a transformer-based vision and language model to generate multimodal embeddings. Our method involves a two-step learning process and a cross-attention-based fusion encoder to facilitate robust information fusion, and a momentum queue-based hard negative mining mechanism to handle noisy training data. Extensive evaluations on the Fashion 200k and Shoes benchmark datasets demonstrate that our model outperforms state-of-the-art methods.
更多
查看译文
关键词
Content Based Image Retrieval,Compositional Image Retrieval,Multimodal Learning,Vision Language Model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要