EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge
CoRR(2024)
Abstract
Despite the remarkable strides of Large Language Models (LLMs) in various
fields, the wide applications of LLMs on edge devices are limited due to their
massive parameters and computations. To address this, quantization is commonly
adopted to generate lightweight LLMs with efficient computations and fast
inference. However, Post-Training Quantization (PTQ) methods dramatically
degrade in quality when quantizing weights, activations, and KV cache together
to below 8 bits. Besides, many Quantization-Aware Training (QAT) works quantize
model weights, leaving the activations untouched, which do not fully exploit
the potential of quantization for inference acceleration on the edge. In this
paper, we propose EdgeQAT, the Entropy and Distribution Guided QAT for the
optimization of lightweight LLMs to achieve inference acceleration on Edge
devices. We first identify that the performance drop of quantization primarily
stems from the information distortion in quantized attention maps, demonstrated
by the different distributions in quantized query and key of the self-attention
mechanism. Then, the entropy and distribution guided QAT is proposed to
mitigate the information distortion. Moreover, we design a token
importance-aware adaptive method to dynamically quantize the tokens with
different bit widths for further optimization and acceleration. Our extensive
experiments verify the substantial improvements with our framework across
various datasets. Furthermore, we achieve an on-device speedup of up to 2.37x
compared with its FP16 counterparts across multiple edge devices, signaling a
groundbreaking advancement.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined