m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers
CoRR(2024)
摘要
Modular neural architectures are gaining increasing attention due to their
powerful capability for generalization and sample-efficient adaptation to new
domains. However, training modular models, particularly in the early stages,
poses challenges due to the optimization difficulties arising from their
intrinsic sparse connectivity. Leveraging the knowledge from monolithic models,
using techniques such as knowledge distillation, is likely to facilitate the
training of modular models and enable them to integrate knowledge from multiple
models pretrained on diverse sources. Nevertheless, conventional knowledge
distillation approaches are not tailored to modular models and can fail when
directly applied due to the unique architectures and the enormous number of
parameters involved. Motivated by these challenges, we propose a general
module-to-module knowledge distillation (m2mKD) method for transferring
knowledge between modules. Our approach involves teacher modules split from a
pretrained monolithic model, and student modules of a modular model. m2mKD
separately combines these modules with a shared meta model and encourages the
student module to mimic the behaviour of the teacher module. We evaluate the
effectiveness of m2mKD on two distinct modular neural architectures: Neural
Attentive Circuits (NACs) and Vision Mixture-of-Experts (V-MoE). By applying
m2mKD to NACs, we achieve significant improvements in IID accuracy on
Tiny-ImageNet (up to 5.6
On average, we observe a 1
V-MoE-Base model trained using m2mKD also achieves 3.5
end-to-end training on ImageNet. The experimental results demonstrate that our
method offers a promising solution for connecting modular networks with
pretrained monolithic models. Code is available at
https://github.com/kamanphoebe/m2mKD.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要