谷歌浏览器插件
订阅小程序
在清言上使用

Multi-level information fusion Transformer with background filter for fine-grained image recognition

Ying Yu, Jinghui Wang, Witold Pedrycz,Duoqian Miao, Jin Qian

Applied Intelligence(2024)

引用 0|浏览5
暂无评分
摘要
Compared to traditional image recognition, Fine-Grained Image Recognition (FGIR) faces significant challenges due to the subtle distinctions among different categories and the notable variances within the same category. Furthermore, the complexity of backgrounds and the extraction of discriminative features limited to small local regions further exacerbate the difficulty. Recently, several studies have demonstrated the effectiveness of the Vision Transformer (ViT) in FGIR. However, these investigations have frequently overlooked critical information embedded within class tokens across different layers, while also neglecting the subtle local details hidden within patch tokens. To address these issues and enhance FGIR performance, we introduce a novel ViT-based network architecture MIFBF. The proposed model builds upon ViT by incorporating three modules: Complementary Class Tokens Combination module (CCTC), Patches Information Integration module (PII), and Attention Cropping Module (ACM). The CCTC module integrates multi-layer class tokens to capture complementary information, thereby enhancing the model’s representational capacity. The PII module delves into the rich local details encoded in patch tokens to improve classification accuracy. The ACM module generates regions of interest based on ViT’s self-attention weights and effectively filters background noise, thereby directing the model’s attention to the most relevant image areas. Experiments conducted on three different datasets validate the effectiveness of the proposed model, yielding state-of-the-art results and highlighting its superiority in FGIR tasks.
更多
查看译文
关键词
Fine-grained image recognition,Vision Transformer,Multi-level information,Information fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要