CIMFormer: A Systolic CIM-Array-Based Transformer Accelerator With Token-Pruning-Aware Attention Reformulating and Principal Possibility Gathering

Ruiqi Guo, Xiaofeng Chen,Lei Wang,Yang Wang,Hao Sun,Jingchuan Wei, Huiming Han,Leibo Liu,Shaojun Wei,Yang Hu,Shouyi Yin

IEEE Journal of Solid-State Circuits（2024）

Cited 0|Views8

No score

Abstract

Transformer models have achieved impressive performance in various artificial intelligence (AI) applications. However, the high cost of computation and memory footprint make its inference inefficient. Although digital compute-in-memory (CIM) is a promising hardware architecture with high accuracy, Transformer’s attention mechanism raises three challenges in the access and computation of CIM: 1) the attention computation involving Query and Key results in massive data movement and under-utilization in CIM macros; 2) the attention computation involving Possibility and Value exhibits plenty of dynamic bit-level sparsity, resulting in redundant bit-serial CIM operations; and 3) the restricted data reload bandwidth in CIM macros results in a significant decrease in performance for large Transformer models. To address these challenges, we design a CIM accelerator called CIM Transformer (CIMFormer) with three corresponding features. First, the token-pruning-aware attention reformulation (TPAR) is a technique that adjusts attention computations according to the token-pruning ratio. This reformulation reduces the real-time access to and under-utilization of CIM macros. Second, the principal possibility gather-scatter scheduler (PPGSS) gathers the possibilities with greater effective bit-width as concurrent inputs to CIM macros, enhancing the efficiency of bit-serial CIM operations. Third, the systolic X

$\mid$

W-CIM macro array efficiently handles the execution of large Transformer models that exceed the storage capacity of the on-chip CIM macros. Fabricated in a 28-nm technology, CIMFormer achieves a peak energy efficiency of 15.71 TOPS/W, with an over 1.46

$\times$

improvement compared with the state-of-the-art Transformer accelerator at an equivalent situation.

Translated text

Key words

Attention,bit sparsity,compute-in-memory (CIM),dataflow,systolic,token pruning,transformer

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined