Retraining-free Model Quantization via One-Shot Weight-Coupling Learning
CoRR(2024)
摘要
Quantization is of significance for compressing the over-parameterized deep
neural models and deploying them on resource-limited devices. Fixed-precision
quantization suffers from performance drop due to the limited numerical
representation ability. Conversely, mixed-precision quantization (MPQ) is
advocated to compress the model effectively by allocating heterogeneous
bit-width for layers. MPQ is typically organized into a searching-retraining
two-stage process. Previous works only focus on determining the optimal
bit-width configuration in the first stage efficiently, while ignoring the
considerable time costs in the second stage. However, retraining always
consumes hundreds of GPU-hours on the cutting-edge GPUs, thus hindering
deployment efficiency significantly. In this paper, we devise a one-shot
training-searching paradigm for mixed-precision model compression.
Specifically, in the first stage, all potential bit-width configurations are
coupled and thus optimized simultaneously within a set of shared weights.
However, our observations reveal a previously unseen and severe bit-width
interference phenomenon among highly coupled weights during optimization,
leading to considerable performance degradation under a high compression ratio.
To tackle this problem, we first design a bit-width scheduler to dynamically
freeze the most turbulent bit-width of layers during training, to ensure the
rest bit-widths converged properly. Then, taking inspiration from information
theory, we present an information distortion mitigation technique to align the
behaviour of the bad-performing bit-widths to the well-performing ones.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要