Chrome Extension
WeChat Mini Program
Use on ChatGLM

DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing

Yansong Xu,Dongxu Lyu,Zhenyu Li, Zilong Wang,Yuzhou Chen,Gang Wang,Zhican Wang, Haomin Li,Guanghui He

CoRR(2024)

Cited 0|Views7
No score
Abstract
Multi-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE underutilization. Meanwhile, existing approaches for attention acceleration cannot be directly applied to MSDeformAttn due to lack of support for this distinct procedure. Therefore, we propose a dedicated algorithm-architecture co-design dubbed DEFA, the first-of-its-kind method for MSDeformAttn acceleration. At the algorithm level, DEFA adopts frequency-weighted pruning and probability-aware pruning for feature maps and sampling points respectively, alleviating the memory footprint by over 80 the architecture level, it explores the multi-scale parallelism to boost the throughput significantly and further reduces the memory access via fine-grained layer fusion and feature map reusing. Extensively evaluated on representative benchmarks, DEFA achieves 10.1-31.9x speedup and 20.3-37.7x energy efficiency boost compared to powerful GPUs. It also rivals the related accelerators by 2.2-3.7x energy efficiency improvement while providing pioneering support for MSDeformAttn.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined