FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback

Sonam Goenka,Zhaoheng Zheng,Ayush Jaiswal,Rakesh Chada,Yue Wu,Varsha Hedau,Pradeep Natarajan

IEEE Conference on Computer Vision and Pattern Recognition（2022）

引用 46|浏览53

暂无评分

摘要

Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image retrieval, and combines visual information from multiple levels of context to effectively capture fashion-related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.

查看译文

关键词

Recognition: detection,categorization,retrieval, Vision + language, Visual reasoning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要