DEMO: A Statistical Perspective for Efficient Image-Text Matching
CoRR(2024)
摘要
Image-text matching has been a long-standing problem, which seeks to connect
vision and language through semantic understanding. Due to the capability to
manage large-scale raw data, unsupervised hashing-based approaches have gained
prominence recently. They typically construct a semantic similarity structure
using the natural distance, which subsequently provides guidance to the model
optimization process. However, the similarity structure could be biased at the
boundaries of semantic distributions, causing error accumulation during
sequential optimization. To tackle this, we introduce a novel hashing approach
termed Distribution-based Structure Mining with Consistency Learning (DEMO) for
efficient image-text matching. From a statistical view, DEMO characterizes each
image using multiple augmented views, which are considered as samples drawn
from its intrinsic semantic distribution. Then, we employ a non-parametric
distribution divergence to ensure a robust and precise similarity structure. In
addition, we introduce collaborative consistency learning which not only
preserves the similarity structure in the Hamming space but also encourages
consistency between retrieval distribution from different directions in a
self-supervised manner. Through extensive experiments on three benchmark
image-text matching datasets, we demonstrate that DEMO achieves superior
performance compared with many state-of-the-art methods.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要