TY - GEN
T1 - Improved knowledge distillation via teacher assistant
AU - Mirzadeh, Seyed Iman
AU - Farajtabar, Mehrdad
AU - Li, Ang
AU - Levine, Nir
AU - Matsukawa, Akihiro
AU - Ghasemzadeh, Hassan
N1 - Funding Information:
Authors Mirzadeh and Ghasemzadeh were supported in part through grant CNS-1750679 from the United States National Science Foundation. The authors would like to thank Luke Metz, Rohan Anil, Sepehr Sameni, Hooman Shahrokhi, Janardhan Rao Doppa, and Hung Bui for their review and feedback.
Publisher Copyright:
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2020
Y1 - 2020
N2 - Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
AB - Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
UR - http://www.scopus.com/inward/record.url?scp=85106586399&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85106586399&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85106586399
T3 - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
SP - 5191
EP - 5198
BT - AAAI 2020 - 34th AAAI Conference on Artificial Intelligence
PB - AAAI press
T2 - 34th AAAI Conference on Artificial Intelligence, AAAI 2020
Y2 - 7 February 2020 through 12 February 2020
ER -