AiluRus: A Scalable ViT Framework for Dense Prediction
NeurIPS(2023)
Abstract
Vision transformers (ViTs) have emerged as a prevalent architecture for
vision tasks owing to their impressive performance. However, when it comes to
handling long token sequences, especially in dense prediction tasks that
require high-resolution input, the complexity of ViTs increases significantly.
Notably, dense prediction tasks, such as semantic segmentation or object
detection, emphasize more on the contours or shapes of objects, while the
texture inside objects is less informative. Motivated by this observation, we
propose to apply adaptive resolution for different regions in the image
according to their importance. Specifically, at the intermediate layer of the
ViT, we utilize a spatial-aware density-based clustering algorithm to select
representative tokens from the token sequence. Once the representative tokens
are determined, we proceed to merge other tokens into their closest
representative token. Consequently, semantic similar tokens are merged together
to form low-resolution regions, while semantic irrelevant tokens are preserved
independently as high-resolution regions. This strategy effectively reduces the
number of tokens, allowing subsequent layers to handle a reduced token sequence
and achieve acceleration. We evaluate our proposed method on three different
datasets and observe promising performance. For example, the "Segmenter ViT-L"
model can be accelerated by 48% FPS without fine-tuning, while maintaining the
performance. Additionally, our method can be applied to accelerate fine-tuning
as well. Experimental results demonstrate that we can save 52% training time
while accelerating 2.46 times FPS with only a 0.09% performance drop. The code
is available at https://github.com/caddyless/ailurus/tree/main.
MoreTranslated text
Key words
scalable vit framework,prediction
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined