ZoomViT: an observation behavior-based fine-grained recognition scheme

Zhipeng Ma,Yongquan Yang, Haicheng Wang,Lei Huang,Zhiqiang Wei

Neural Computing and Applications(2024)

引用 0|浏览4
暂无评分
摘要
Fine-grained image recognition aims to distinguish many images with subtle differences and identify the sub-categories to which they belong. Recently, vision transformer (ViT) has achieved promising results in many computer vision tasks. In this paper, we introduce human observation behavior into ViT and propose a novel transformer-based network, named ZoomViT. We divide the fine-grained recognition into two steps "look closer" and "contrast." Firstly, looking closer is to observe finer local regions and multi-scale features, and avoid the adverse effect of background on recognition. We design the zoom-in module to track the attention flow by integrating the attention weights to zoom in the discriminative foreground regions. Subsequently, the straight image splitting like ViT may harm recognition adversely. Therefore, we design the zoom-out module combining overlapping cutting and downsampling to maintain the integrity of local neighboring structures and the running efficiency of the model in recognition. Finally, we propose to contrast the features of known sub-categories to supervise the model to learn subtle differences among different sub-categories. The consistency of features extracted from different batches increases over time; for this reason, we proposed a variable-length queue to store features from different batches to efficiently and fully conduct contrastive learning. We experimentally demonstrate the state-of-the-art performance of our model on four popular fine-grained benchmarks: CUB-200-2011, Stanford Dogs, NABirds, and iNat2017.
更多
查看译文
关键词
Fine-grained image recognition,Image classification,Visual attention,Local region feature,Discriminative foreground,Observation behavior
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要