In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller model without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).[1]
Knowledge distillation has been successfully used in several applications of machine learning such as object detection,[2] acoustic models,[3] and natural language processing.[4] Recently, it has also been introduced to graph neural networks applicable to non-grid data.[5]
Transferring the knowledge from a large to a small model needs to somehow teach to the latter without loss of validity. If both models are trained on the same data, the small model may have insufficient capacity to learn a concise knowledge representation given the same computational resources and same data as the large model. However, some information about a concise knowledge representation is encoded in the pseudolikelihoods assigned to its output: when a model correctly predicts a class, it assigns a large value to the output variable corresponding to such class, and smaller values to the other output variables. The distribution of values among the outputs for a record provides information on how the large model represents knowledge. Therefore, the goal of economical deployment of a valid model can be achieved by training only the large model on the data, exploiting its better ability to learn concise knowledge representations, and then distilling such knowledge into the smaller model, that would not be able to learn it on its own, by training it to learn the soft output of the large model.
A related methodology was model compression or pruning, where a trained network is reduced in size, via methods such as Biased Weight Decay[6] and Optimal Brain Damage.[7]
The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration.[8] In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines[9] [10] or both are parity machines.[11]
Another early example of network distillation was also published in 1992, in the field of recurrent neural networks (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.[12]
A related methodology to compress the knowledge of multiple models into a single neural network was called model compression in 2006. Compression was achieved by training a smaller model on large amounts of pseudo-data labelled by a higher-performing ensemble, optimising to match the logit of the compressed model to the logit of the ensemble.[13] Knowledge distillation is a generalisation of such approach, introduced by Geoffrey Hinton et al. in 2015, in a preprint that formulated the concept and showed some results achieved in the task of image classification.
Knowledge distillation is also related to the concept of behavioral cloning discussed by Faraz Torabi et. al.[14]
Given a large model as a function of the vector variable
x
yi(x|t)=
| |||||||||||||
|
t
zi(x)
y(x|t)
\hat{y
t
E(x|t)=-\sumi\hat{y}i(x|t)logyi(x|t).
If ground truth is available for the transfer set, the process can be strengthened by adding to the loss the cross-entropy between the output of the distilled model (computed with
t=1
\bar{y}
E(x|t)=-t2\sumi\hat{y}i(x|t)logyi(x|t)-\sumi\bar{y}ilog\hat{y}i(x|1)
t2
1 | |
t2 |
Under the assumption that the logits have zero mean, it is possible to show that model compression is a special case of knowledge distillation. The gradient of the knowledge distillation loss
E
zi
\begin{align}
\partial | |
\partialzi |
E &=-
\partial | |
\partialzi |
\sumj\hat{y}jlogyj\\ &=-
\partial | |
\partialzi |
\hat{y}ilogyi+\left(-
\partial | |
\partialzi |
\sumk ≠ \hat{y}klogyk\right)\\ &=-\hat{y}i
1 | |
yi |
\partial | |
\partialzi |
yi+\sumk ≠ \left(-\hat{y}k ⋅
1 | |
yk |
⋅
| ||||
e |
⋅ \left(-
1 | |||||||||||||
|
\right) ⋅
| ||||
e |
⋅
1 | |
t |
\right)\\ &=-\hat{y}i
1 | |
yi |
\partial | |
\partialzi |
| |||||||||||||
|
+\sumk ≠ \left(\hat{y}k ⋅
1 | |
yk |
⋅ yk ⋅ yi ⋅
1 | |
t |
\right)\\ &=-\hat{y}i
1 | |
yi |
\left(
| |||||||||||||||||||||||||||||
|
\right)+
yi\sumk ≠ \hat{y | |
k}{t}\\ |
&=-\hat{y}i
1 | |
yi |
\left(
yi | |
t |
-
| |||||||
t |
\right)+
yi(1-\hat{y | |
i)}{t}\\ |
&=
1 | |
t |
\left(yi-\hat{y}i\right)\\ &=
1 | |
t |
\left(
| |||||||||||||
|
-
| ||||||||||
\hat{z}i
t
1 | |
t |
\left(
| ||||||
|
-
| ||||||
t |
\sumjzj=\sumj\hat{z}j=0
zi-\hat{z | |
i}{NT |
2}
1 | |
2 |
\left(zi-\hat{z}i\right)2