One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
arxiv(2024)
Abstract
Large Language Models (LLMs) have advanced rapidly but face significant
memory demands. While quantization has shown promise for LLMs, current methods
typically require lengthy training to alleviate the performance degradation
from quantization loss. However, deploying LLMs across diverse scenarios with
different resource constraints, e.g., servers and personal computers, requires
repeated training per application, which amplifies the lengthy training
problem. Given that, it is advantageous to train a once-for-all (OFA) supernet
capable of yielding diverse optimal subnets for downstream applications through
one-shot training. Nonetheless, the scale of current language models impedes
efficiency and amplifies interference from weight sharing between subnets. We
make an initial attempt to extend the once-for-all framework to large language
models. Specifically, we decouple shared weights to eliminate the interference
and incorporate Low-Rank adapters for training efficiency. Furthermore, we
observe the imbalance allocation of training resources from the traditional
uniform sampling. A non-parametric scheduler is introduced to adjust the
sampling rate for each quantization configuration, achieving a more balanced
allocation among subnets with varying demands. We validate the approach on
LLaMA2 families, and downstream evaluation confirms our ability to maintain
high performance while significantly reducing deployment time faced with
multiple scenarios.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined