当前位置：网站首页>Thesis study -- masked generative disintegration

Thesis study -- masked generative disintegration

2022-07-28 14:13:00 【jiangchao98】

The paper ：https://arxiv.org/abs/2205.01529

notice CV Distillation of knowledge in the field , After reading the summary and pictures , Great joy in my heart, hahaha, knowledge distillation can still do this .

Knowledge distillation can be mainly divided into logit Distillation and feature Distillation . among feature Distillation has better expansibility , It has been applied in many visual tasks . However, due to the different model structures of different tasks , many feature Distillation method is designed for a specific task .

Previous knowledge distillation methods focused on making students imitate the characteristics of stronger teachers , So that students' characteristics have stronger representation ability . We believe that improving students' representation ability does not necessarily need to be achieved by directly imitating teachers . From this point of view , We changed the imitation task to Build task ： Let students rely on their weak characteristics to generate teachers' strong characteristics . In distillation , We made a random analysis of students' characteristics mask, Force students to use only some of their own characteristics to generate all the characteristics of teachers , To improve students' representation ability .

In order to prove MGD Not by imitating teachers to improve students , We have visualized the characteristic map of students and teachers . You can see , The attention of students and teachers before distillation is very different . In the use of FGD Distillation （ Imitate teachers ） after , Students' attention becomes very close to teachers , Performance has also been greatly improved . But when using MGD After distillation , There are great differences between students and teachers , Students' response to the background is greatly reduced , The response to the target is enhanced , Students' final performance is better than FGD Distillation .

The usual way , Make the characteristics of students imitate the characteristics of teachers , Use KL Divergence or MSE To align students' characteristics with teachers' characteristics , But the effect of students' imitation is bound to be worse than that of teachers' model .

Generation with Masked Feature

The method used in this paper , The characteristics of the student model are randomly mask cover , Through convolution and ReLU Activate the function to perform feature transformation , Make the feature repaired

loss The loss function is ：

S、T It means students and teachers respectively ,L Indicates the number of layers ,C、H、W Represents the size of the feature map

General training loss Function is ： $L_{all} = L_{original} + \alpha L_{dis}$ , For the original loss And Distilled loss Add up

MGD It is a feature-based distillation method , It can be applied to different tasks . Experiments are carried out on different tasks , Including image classification 、 object detection 、 Semantic segmentation 、 Instance segmentation .

MGD Force students to produce the complete characteristic map of teachers instead of directly imitating teachers , It helps the student model better represent the input of pictures . Imitative distillation loss Is the loss function of the square difference between teacher and student characteristic graphs .

loss The loss is shown in the figure , Use MGD Of distillation loss Gradually surpass teacher model

原网站

版权声明
本文为[jiangchao98]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207281258149646.html

当前位置：网站首页>Thesis study -- masked generative disintegration

Thesis study -- masked generative disintegration

Generation with Masked Feature

边栏推荐

猜你喜欢

随机推荐