Chrome Extension
WeChat Mini Program
Use on ChatGLM

Mixture lightweight transformer for scene understanding

COMPUTERS & ELECTRICAL ENGINEERING(2023)

Cited 0|Views19
No score
Abstract
In adapting Transformer from language to computer vision, the major obstacles are the high computational complexity and large model size of Transformer blocks, derived from the great quantity of visual tokens and high resolutions of input entities. To address these challenges, this paper presents a mixture lightweight Transformer (MLT) backbone for image understanding, where each Transformer block, called SH-Transformer, adopts Single-Head Self -Attention (SHSA) and Convolutional Inception Module (CIM). Unlike previous Transformers that compute Multi-Head Self-Attention (MHSA), SHSA limits the representation of input token into a single head, resulting in a low-dimensional embedding that greatly reduces computational complexity. In spite of adding a small number of model parameters, SHSA greatly minimizes the number of input tokens. As complementary of SHSA that only investigates global interactions, CIM is designed to explore multi-scale local information using lightweight convolutions in a multi-path parallel manner. Experimental results reveal that MLT yields competitive or state-of-the-art results with respect to recent transformers while keeping smaller model size and lower computational costs for different visual tasks, including image classification, semantic segmentation and object detection. Particularly, the proposed method has 4.2% improvement to the tiny version of Pyramid Vision Transformer on image classification of top-1 accuracy on ImageNet-1K.
More
Translated text
Key words
Transformers, Lightweight backbone, Multi-scale pyramid pooling, Convolutional inception
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined