Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
CoRR(2024)
Abstract
Kullback-Leiber divergence has been widely used in Knowledge Distillation
(KD) to compress Large Language Models (LLMs). Contrary to prior assertions
that reverse Kullback-Leibler (RKL) divergence is mode-seeking and thus
preferable over the mean-seeking forward Kullback-Leibler (FKL) divergence,
this study empirically and theoretically demonstrates that neither mode-seeking
nor mean-seeking properties manifest in KD for LLMs. Instead, RKL and FKL are
found to share the same optimization objective and both converge after a
sufficient number of epochs. However, due to practical constraints, LLMs are
seldom trained for such an extensive number of epochs. Meanwhile, we further
find that RKL focuses on the tail part of the distributions, while FKL focuses
on the head part at the beginning epochs. Consequently, we propose a simple yet
effective Adaptive Kullback-Leiber (AKL) divergence method, which adaptively
allocates weights to combine FKL and RKL. Metric-based and GPT-4-based
evaluations demonstrate that the proposed AKL outperforms the baselines across
various tasks and improves the diversity and quality of generated responses.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined