LocoMixer: A Local Context MLP-Like Architecture For Image Classification

2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI(2023)

Cited 0|Views0
No score
Abstract
While Convolutional Neural Networks (CNNs) have been the de-facto architecture for computer vision tasks, recent works have shown that self-attention-based models, such as the Vision Transformer (ViT), have achieved competitive results on various visual benchmarks compared to CNNs. More recently, MLP-Mixer replaces the self-attention layer with a spatial MLP block as a token-mixing module. It attains good scores on image classification tasks when training on large datasets with strong regularization. In this paper, we propose a Local Context MLP-like architecture (LocoMixer). Unlike MLP-Mixer, where global spatial information is captured for information exchange through token-mixing MLP, we primarily focus on the local dependencies between the patch embeddings. Our LocoMixer independently encodes the spatial features along the rows and columns of 2D image patches with fully-connected layers. Meanwhile, we apply a scale-transformation operation to learn multi-scale feature representations in our proposed multiscale token-MLP. Furthermore, we further model local cross-channel interaction by absorbing a context block into the proposed dynamic channel-MLP. When only trained on the ImageNet-1K dataset, the proposed LocoMixer achieves state-of-the-art performace compared with other MLP-like models. Moreover, the proposed LocoMixer achieves 82.3% top-1 accuracy with only 26M parameters, which is much better than most CNNs and vision Transformers under the same model size constraint. When scaling up to 90M parameters, LocoMixer achieves 84.2% top1 accuracy, which outperforms the state-of-the-art Deformable Attention Transformer (DAT).
More
Translated text
Key words
Image Classification,MLP,Deep Neural Network
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined