Chrome Extension
WeChat Mini Program
Use on ChatGLM

TransFRU: Efficient Deployment of Transformers on FPGA with Full Resource Utilization

ASPDAC '24 Proceedings of the 29th Asia and South Pacific Design Automation Conference(2024)

Cited 0|Views6
No score
Abstract
Transformer-based models have achieved huge success in various artificial intelligence (AI) tasks, e.g., natural language processing (NLP) and computer vision (CV). However, transformer-based models always suffer from high computation density, making them hard to be deployed on resource-constrained devices like field-programmable gate array (FPGA). Among the overall process of transformers, self-attention contributes to most of the computation load and becomes the bottleneck of transformer-based models. In this paper, we propose TransFRU, a novel FPGA-based accelerator for self-attention mechanism with full utilization of hardware resources. Specifically, we first leverage 4-bit and 8-bit processing elements (PEs) to package multiple signed multiplications into one DSP block. Second, we skip the zero and near-zero values in the intermediate result of self-attention by a sorting engine. The sorting engine is also responsible for operand sharing to boost the computation efficiency of one DSP block. Experimental results show that our TransFRU achieves 7.86-49.16x speedup and 151.1x energy efficiency compared with CPU, 1.41x speedup and 5.9x energy efficiency compared with GPU. Furthermore, we observe 1.91-13.56x better throughput per DSP block and 3.53-9.62x energy efficiency compared with previous FPGA accelerators.
More
Translated text
Key words
Transformer,Self-attention,FPGA,DSP
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined