Deep Top- $k$ Ranking for Image–Sentence Matching
IEEE Transactions on Multimedia(2020)
摘要
Image–sentence matching is a challenging task for the heterogeneity-gap between different modalities. Ranking-based methods have achieved excellent performance in this task in past decades. Given an image query, these methods typically assume that the correct matched image–sentence pair must rank before all other mismatched ones. However, this assumption may be too strict and prone to the overfitting problem, especially when some sentences in a massive database are similar and confusable with one another. In this paper, we relax the traditional ranking loss and propose a novel deep multi-modal network with a top-
$k$
ranking loss to mitigate the data ambiguity problem. With this strategy, query results will not be penalized unless the index of ground truth is outside the range of top-
$k$
query results. Considering the non-smoothness and non-convexity of the initial top-
$k$
ranking loss, we exploit a tight convex upper bound to approximate the loss and then utilize the traditional back-propagation algorithm to optimize the deep multi-modal network. Finally, we apply the method on three benchmark datasets, namely, Flickr8k, Flickr30k, and MSCOCO. Empirical results on metrics R@K (K = 1, 5, 10) show that our method achieves comparable performance in comparison to state-of-the-art methods.
更多查看译文
关键词
Task analysis,Bidirectional control,Databases,Training,Deep learning,Sports,Semantics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络