当前位置:网站首页>Deep learning (self supervised: Moco V3): An Empirical Study of training self supervised vision transformers
Deep learning (self supervised: Moco V3): An Empirical Study of training self supervised vision transformers
2022-07-28 06:09:00 【Food to doubt life】
List of articles
Preface
MoCo V3 It's a new work by he Kaiming's team , Published in ICCV 2021 On , It is a self supervised article , stay MoCo V2 Some minor changes have been made on the basis of , meanwhile report 了 ViT In the process of self-monitoring training “ unstable ” The phenomenon , And give a trick, For mitigation ViT Unstable phenomenon of self-monitoring training .
Kaiming team's article is as detailed as ever , This paper mainly summarizes MoCo V3 And ViT In the process of self-monitoring training “ unstable ” The phenomenon .MoCo V3 Relevant performance tests will not be summarized too much
This article is a personal summary , If the error , Welcome to point out
This article assumes that the reader knows MoCo V2 And common methods of comparative learning algorithms , I won't go into details too much .
MoCo V3
Compared with MoCo V2
- MoCo V3 Abandoned memory bank, Negative examples are from the same batch size Other sample compositions in , Be similar to simCLR. The author found that when batch size When large enough , from batch size Choose negative examples from memory bank There is little difference in the performance of selecting negative examples .
- The loss function is InfoNCE, The mathematical expression is as follows
q 、 k + q、k^+ q、k+ Form a positive example pair in comparative learning 、 q 、 k − q、k^- q、k− Form a negative example pair in comparative learning , How to produce , You can see Deep learning ( Self supervision :MoCo)——Momentum Contrast for Unsupervised Visual Representation Learning
The algorithm pseudo code is as follows 
And MoCo v2、MoCo V2+( Namely simsiam) stay ImageNet The comparison of linear classification performance on is shown in the following figure ,backbone by ResNet50
ViT In the process of self-monitoring training “ unstable ” The phenomenon
CNN Model self-monitoring training ,KNN The accuracy of the classifier rises smoothly , however ViT No , As shown in the figure below 
The above figure compares the differences batch size Next ,backbone by ViT Under the circumstances ,MoCo V3 Of KNN Performance change curve , You can see ,batch size The bigger it is , The instability of training is more obvious , namely KNN The more obvious the oscillation of accuracy , And this oscillation phenomenon will affect the performance of the model ,batch size=6144 when , The oscillation phenomenon is very obvious , And the accuracy of linear classification is only 69.7.
Can we adjust the learning rate , To alleviate the instability of training ?
From the picture above , Using a smaller learning rate can indeed reduce training instability , But this will affect the performance of the model , When the learning rate is 0.5 when , Although training instability has slowed down , But it will affect the performance of the model .
Is it right? MoCo V3 The comparative learning algorithm of ViT There is instability in training ? The author found BYOL、simCLR There is a similar phenomenon in , As shown in the blue line below , This shows that ViT The unstable phenomenon of training may have certain universality .
How to alleviate ViT Instability of self-monitoring training
The author notes that , When the gradient changes greatly , It can lead to KNN The accuracy of “ Trough ”, After comparing the gradients of all layers , The author found that the peak gradient appeared in the first layer at the earliest time , And after a few epochs After training , The gradient of the last layer will also peak , As shown in the figure below 
Based on this , The author assumes that training instability will appear in the shallow layer earlier , In that case , Then fix the parameters of shallow layer , So that there will be no training instability , Will it improve the instability of training ? Based on this , The author fixed ViT Of patch projection layer, At this time, the performance change of the model is shown by the green line in the following figure 
You can see , Even with a larger learning rate , Such as 1.5x 1 0 − 4 10^{-4} 10−4, At this time, the model training is still stable , And at this time, the linear classification performance of the model is also consistent with the use ResNet50 As backbone The approximate , It can be seen that alleviating the instability of training helps to improve the performance of the model ( There is no mitigation of training instability , The learning rate is 1.5x 1 0 − 4 10^{-4} 10−4 when , The linear classification accuracy of the model is 71.7), And this approach is not only aimed at MoCo V3 It works , simCLR、BYOL Still valid .
reflection
The author mentioned in the original text , Fix ViT Of patch projection layer, It is equivalent to reducing the size of understanding space , So smaller parameter quantity ViT Will there still be training instability ? Fixed patch projection layer, It is equivalent to reducing the depth of the network , Whether training instability is related to network depth ( Be similar to CNN The early problems of gradient explosion and gradient disappearance )?
边栏推荐
- Four perspectives to teach you to choose applet development tools?
- First meet flask
- Quick look-up table to MD5
- Mysql5.6 (according to.Ibd,.Frm file) restore single table data
- Construction of redis master-slave architecture
- 将项目部署到GPU上,并且运行
- Wechat applet development and production should pay attention to these key aspects
- ssh/scp断点续传rsync
- 自动定时备份远程mysql脚本
- Nlp项目实战自定义模板框架
猜你喜欢

字节Android岗4轮面试,收到 50k*18 Offer,裁员风口下成功破局

小程序商城制作一个需要多少钱?一般包括哪些费用?

Use Python to encapsulate a tool class that sends mail regularly

Centos7 installing MySQL

强化学习——Proximal Policy Optimization Algorithms

深度学习(自监督:MoCo v2)——Improved Baselines with Momentum Contrastive Learning

强化学习——连续控制

Four perspectives to teach you to choose applet development tools?

卷积神经网络

What are the points for attention in the development and design of high-end atmospheric applets?
随机推荐
Sqlalchemy usage related
深度学习(自监督:CPC v2)——Data-Efficient Image Recognition with Contrastive Predictive Coding
ssh/scp断点续传rsync
深度学习——Patches Are All You Need
Tensorboard visualization
【6】 Redis cache policy
微信小程序开发费用制作费用是多少?
Utils commonly used in NLP
Micro service architecture cognition and service governance Eureka
Assembly packaging
深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations
It's not easy to travel. You can use digital collections to brush the sense of existence in scenic spots
Uniapp WebView listens to the callback after the page is loaded
UNL-类图
强化学习——多智能体强化学习
Invalid packaging for parent POM x, must be “pom“ but is “jar“ @
微信上的小程序店铺怎么做?
Small program development solves the anxiety of retail industry
小程序制作小程序开发适合哪些企业?
The business of digital collections is not so easy to do