Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms

Jieru Zhao, Pai Zeng, Guan Shen,Quan Chen,Minyi Guo

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2024)

引用 0|浏览1
暂无评分
摘要
The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Sparse attention techniques, including both static and dynamic sparsity, reduce the quadratic complexity by computing attention on partial queries and keys. These static and dynamic methods exhibit a trade-off between efficiency and adaptability, making them applicable to different scenarios. However, existing accelerators either target specific domains or encounter performance degradation when dealing with long sequences. None of them can enable static and dynamic sparse attention mechanisms simultaneously. To this end, we propose SALO2, a hardware-software co-design framework that facilitates efficient static and dynamic sparse attention computations and can be applied to various scenarios, tasks, and inputs. Experiments show that SALO2 achieves 104.80x, 13.65x, 1.38x speedup compared to Intel Xeon CPU, NVIDIA RTX4090 GPU, and SALO (the SOTA accelerator exploiting static sparsity) on tasks with long input sequences, and achieves 76.17x, 8.98x, 1.71x speedup compared to Intel Xeon CPU, NVIDIA RTX4090 GPU, and Sanger (the SOTA accelerator exploiting dynamic sparsity) on tasks with shorter sequences. The source code is available at https://github.com/sjtu-zhao-lab/SALO.githttps://github.com/sjtu-zhao-lab/SALO.git.
更多
查看译文
关键词
Attention acceleration,static/dynamic sparsity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要