Estimating and Maximizing Mutual Information for Knowledge Distillation

arxiv(2023)

引用 3|浏览11
暂无评分
摘要
In this work, we propose Mutual Information Maximization Knowledge Distillation (MIMKD). Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network. We demonstrate through extensive experiments that this can be used to improve the performance of low capacity models by transferring knowledge from more performant but computationally expensive models. This can be used to produce better models that can be run on devices with low computational resources. Our method is flexible, we can distill knowledge from teachers with arbitrary network architectures to arbitrary student networks. Our empirical results show that MIMKD outperforms competing approaches across a wide range of student-teacher pairs with different capacities, with different architectures, and when student networks are with extremely low capacity. We are able to obtain 74.55% accuracy on CIFAR100 with a ShufflenetV2 from a baseline accuracy of 69.8% by distilling knowledge from ResNet-50. On Imagenet we improve a ResNet-18 network from 68.88% to 70.32% accuracy (1.44%+) using a ResNet-34 teacher network.
更多
查看译文
关键词
arbitrary network architectures,arbitrary student networks,CIFAR100,computational resources,contrastive objective,global feature representations,Imagenet,knowledge transfer,local feature representations,MIMKD,mutual information maximization knowledge distillation,ResNet-18 network,ResNet-34 teacher network,ResNet-50,ShufflenetV2,student-teacher pairs
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要