当前位置:网站首页>Interpreting the knowledge in a neural network
Interpreting the knowledge in a neural network
2022-07-28 06:12:00 【An instant of loss】
problem : Due to the complexity of the network structure , The cost of forecasting is too high , It is difficult to deploy the network to lightweight device users .
resolvent : Use knowledge distillation to compress the model , Realize lightweight network .
Next, take this paper as the basis to understand knowledge distillation .
1、 Soft label and hard label
describe : Hard label means that the correct value when we predict is 1, The wrong value is 0. Soft tags think that the wrong tags cannot be all zero , Because there is always a gap for wrong labels , This is shown below .
Hard tags Soft label
BMW 1 0.9
Mercedes 0 0.6
Garbage truck 0 0.3
Carrot 0 0.001
It can be seen from the above description that , When we predict BMW , Except that the correct label is classified as 1, Other labels are 0, Also is 0 namely 1. But in fact, in the error category , BMW is more like Benz than carrot , This shows that there is also a gap in the information of the wrong label . So soft tags are introduced , Change the tag value to 0 To 1 Between the value of the , In this way, the information presented is richer .
2、 temperature coefficient T
Soft tags can present the information of wrong tags , However, the gap between some wrong labels is not obvious , That is, the tag value is not soft enough . At this time, in order to make this inconspicuous gap more obvious , The author introduced the temperature coefficient T To change the original SoftMax function , namely :
In style qi On behalf of the use of SoftMax Category probability of output 、zi For each category logit、T Is the temperature coefficient .
PS: If T Small , The information gap of error categories is small , But if T Too big , The label will be too soft , It is easy to lead to equalitarianism , It is difficult to achieve the predicted effect , Specifically T The selection effect of is as follows .
In the figure , We can see that T The bigger it is , The smoother the curve . If T If it is too big, it is easy to appear equalitarianism , It will become unpredictable .
3、 Knowledge distillation network framework
In the figure, the samples are fed into the teacher model and the student model for training , The teacher model is the original complex model , The student model is a simple model after compression . In calculating the loss , Distillation loss and student loss need to be calculated separately , The distillation loss is at a temperature of T Calculate the cross entropy of the output of the two networks of teachers and students , And the student loss is at the temperature of 1 Calculate the cross entropy of real tags and student network output , The formula is as follows .
Common in formula m A sample and n Categories ,y Represents the real label ( Hard tags ) ,P(xij) Indicates the soft label of students' network output ,Yij Indicates the soft label of teachers' network output ,λ Express 0 To 1 Weight coefficient between .
4、 Cross entropy gradient
Suppose the teacher model is vi, The student model is zi, The soft target probability distribution of the two is qi and pi, The gradient of :
1、 Suppose the temperature coefficient can be infinite , The formula is transformed into :
PS: use
Taylor expansion of , If T infinity , Then the latter items can be ignored , At this point, let's take Taylor's first two , Just change to the above form .
2、 Suppose different samples logit The value is 0:
PS: The sum of the two Sigma in the denominator is 0.
Therefore, in the special case where these two conditions are met , It turns into minimizing the mean square error .
5、 Conclusion
In order to verify the effect of knowledge distillation , The author of the paper MINST A pre experiment was carried out on the handwritten numeral set , Then the performance of knowledge distillation is verified by the complex model distillation of speech recognition .

边栏推荐
- CertPathValidatorException:validity check failed
- Deep learning (self supervision: simple Siam) -- Exploring simple Siamese representation learning
- Centos7 installing MySQL
- 机器学习之聚类
- Digital collections become a new hot spot in tourism industry
- UNL class diagram
- How much does it cost to make a small program mall? What are the general expenses?
- Why is the kotlin language not popular now? What's your opinion?
- Dataset类分批加载数据集
- How to do wechat group purchase applet? How much does it usually cost?
猜你喜欢

用于快速低分辨率人脸识别模型训练的改进知识蒸馏《Improved Knowledge Distillation for Training Fast LR_FR》

将项目部署到GPU上,并且运行

微信小程序开发制作注意这几个重点方面

微信团购小程序怎么做?一般要多少钱?

Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding

Deep learning - metaformer is actually what you need for vision

微信小程序制作模板套用时需要注意什么呢?

深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning

深度学习(增量学习)——ICCV2022:Contrastive Continual Learning

Deep learning - patches are all you need
随机推荐
The business of digital collections is not so easy to do
深度学习——Patches Are All You Need
TensorFlow2.1基本概念与常见函数
Kubesphere installation version problem
小程序开发
uniapp webview监听页面加载后回调
Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
小程序搭建制作流程是怎样的?
深度学习(自监督:SimSiam)——Exploring Simple Siamese Representation Learning
Quick look-up table to MD5
Construction of redis master-slave architecture
将项目部署到GPU上,并且运行
Digital collections "chaos", 100 billion market change is coming?
What are the advantages of small program development system? Why choose it?
The signature of the update package is inconsistent with that of the installed app
ssh/scp断点续传rsync
【6】 Redis cache policy
【1】 Introduction to redis
Automatic scheduled backup of remote MySQL scripts
Which is more reliable for small program development?






Taylor expansion of , If T infinity , Then the latter items can be ignored , At this point, let's take Taylor's first two , Just change to the above form .