当前位置:网站首页>Thesis study -- masked generative disintegration
Thesis study -- masked generative disintegration
2022-07-28 14:13:00 【jiangchao98】
The paper :https://arxiv.org/abs/2205.01529
github:https://github.com/yzd-v/MGD
notice CV Distillation of knowledge in the field , After reading the summary and pictures , Great joy in my heart, hahaha, knowledge distillation can still do this .

Knowledge distillation can be mainly divided into logit Distillation and feature Distillation . among feature Distillation has better expansibility , It has been applied in many visual tasks . However, due to the different model structures of different tasks , many feature Distillation method is designed for a specific task .
Previous knowledge distillation methods focused on making students imitate the characteristics of stronger teachers , So that students' characteristics have stronger representation ability . We believe that improving students' representation ability does not necessarily need to be achieved by directly imitating teachers . From this point of view , We changed the imitation task to Build task : Let students rely on their weak characteristics to generate teachers' strong characteristics . In distillation , We made a random analysis of students' characteristics mask, Force students to use only some of their own characteristics to generate all the characteristics of teachers , To improve students' representation ability .
In order to prove MGD Not by imitating teachers to improve students , We have visualized the characteristic map of students and teachers . You can see , The attention of students and teachers before distillation is very different . In the use of FGD Distillation ( Imitate teachers ) after , Students' attention becomes very close to teachers , Performance has also been greatly improved . But when using MGD After distillation , There are great differences between students and teachers , Students' response to the background is greatly reduced , The response to the target is enhanced , Students' final performance is better than FGD Distillation .

The usual way , Make the characteristics of students imitate the characteristics of teachers , Use KL Divergence or MSE To align students' characteristics with teachers' characteristics , But the effect of students' imitation is bound to be worse than that of teachers' model .

Generation with Masked Feature
The method used in this paper , The characteristics of the student model are randomly mask cover , Through convolution and ReLU Activate the function to perform feature transformation , Make the feature repaired

loss The loss function is :

S、T It means students and teachers respectively ,L Indicates the number of layers ,C、H、W Represents the size of the feature map
General training loss Function is :
, For the original loss And Distilled loss Add up

MGD It is a feature-based distillation method , It can be applied to different tasks . Experiments are carried out on different tasks , Including image classification 、 object detection 、 Semantic segmentation 、 Instance segmentation .
MGD Force students to produce the complete characteristic map of teachers instead of directly imitating teachers , It helps the student model better represent the input of pictures . Imitative distillation loss Is the loss function of the square difference between teacher and student characteristic graphs .
loss The loss is shown in the figure , Use MGD Of distillation loss Gradually surpass teacher model
边栏推荐
- 【服务器数据恢复】HP StorageWorks系列服务器RAID5两块盘离线的数据恢复
- 关于栈的理解以及实际应用场景
- [utils] fastdfs tool class
- Security assurance is based on software life cycle - networkpolicy application
- 修订版 | 目标检测:速度和准确性比较(Faster R-CNN,R-FCN,SSD,FPN,RetinaNet和YOLOv3)...
- 对“Image Denoising Using an Improved Generative Adversarial Network with Wasserstein Distance“的理解
- DXF reading and writing: Chinese description of dimension style group codes
- QT自制软键盘 最完美、最简单、跟自带虚拟键盘一样
- VOS3000如何呼入送到OKCC
- 83. (cesium home) how the cesium example works
猜你喜欢

Diablo 4 ps4/ps5 beta has been added to the Playstation database

Operator3 - design an operator

30 day question brushing plan (IV)

目标检测:速度和准确性比较(Fater R-CNN,R-FCN,SSD,FPN,RetinaNet和YOLOv3)

30 day question brushing training (I)

MySQL开发技巧——视图

LeetCode 105.从前序与中序遍历序列构造二叉树 && 106.从中序与后序遍历序列构造二叉树

Solve the problem that uniapp wechat applet canvas cannot introduce fonts

How to effectively conduct the review meeting (Part 1)?

安全保障基于软件全生命周期-PSP应用
随机推荐
协同办公工具:在线白板初起步,在线设计已红海
The default storage engine after MySQL 5.5 is InnoDB.
Record a fake login of cookie
关于栈的理解以及实际应用场景
webSocket聊天
TS扫盲大法-基础篇
[translation] how to choose a network gateway for your private cloud
Dojp1520 gate jumping problem solution
Custom Configuration Sections
ES6 what amazing writing methods have you used
【Utils】ServletUtil
Websocket chat
创建线程池的四种方式
30 day question brushing plan (III)
IP black and white list
Implementation of StrCmp, strstr, memcpy, memmove
LeetCode 1331.数组序号转换
阿里、京东、抖音:把云推向产业心脏
Leetcode 105. construct binary tree from preorder and inorder traversal sequence & 106. construct binary tree from inorder and postorder traversal sequence
URL related knowledge points