A Comprehensive Evaluation of Quantization Strategies for Large Language Models
CoRR(2024)
Abstract
Increasing the number of parameters in large language models (LLMs) usually
improves performance in downstream tasks but raises compute and memory costs,
making deployment difficult in resource-limited settings. Quantization
techniques, which reduce the bits needed for model weights or activations with
minimal performance loss, have become popular due to the rise of LLMs. However,
most quantization studies use pre-trained LLMs, and the impact of quantization
on instruction-tuned LLMs and the relationship between perplexity and benchmark
performance of quantized LLMs are not well understood. Evaluation of quantized
LLMs is often limited to language modeling and a few classification tasks,
leaving their performance on other benchmarks unclear. To address these gaps,
we propose a structured evaluation framework consisting of three critical
dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and
conduct extensive experiments across ten diverse benchmarks. Our experimental
results indicate that LLMs with 4-bit quantization can retain performance
comparable to their non-quantized counterparts, and perplexity can serve as a
proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs
with larger parameter scales can outperform smaller LLMs. Despite the memory
savings achieved through quantization, it can also slow down the inference
speed of LLMs. Consequently, substantial engineering efforts and hardware
support are imperative to achieve a balanced optimization of decoding speed and
memory consumption in the context of quantized LLMs.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined