当前位置:网站首页>Deep learning (self supervision: simple Siam) -- Exploring simple Siamese representation learning
Deep learning (self supervision: simple Siam) -- Exploring simple Siamese representation learning
2022-07-28 06:09:00 【Food to doubt life】
List of articles
Preface
This article was published by he Kaiming group in CVPR2021 Articles on , At present, it has been nominated for the best paper , It mainly solves the problem of collapse in self supervised comparative learning . Collapse means no matter what input , The feature vectors output by the feature extractor are the same .
This article will briefly introduce SimSiam, Record the interesting experimental results .
The author did not explain why SimSiam Can avoid collapse , But the article is really brilliant
SimSiam sketch

The figure above shows SimSiam Overall structure , To be specific
- For the input image x Apply data enhancement , obtain x 1 x_1 x1、 x 2 x_2 x2
- take x 1 x_1 x1、 x 2 x_2 x2 Input into the same feature extractor , And through a projection MLP Get processed z 1 z_1 z1、 z 2 z_2 z2
- z 1 z_1 z1 after prediction MLP Handle , obtain p 1 p_1 p1
Contrast learning loss by 
Back propagation , z 2 ∣ ∣ z 2 ∣ ∣ 2 \frac{z_2}{||z_2||_2} ∣∣z2∣∣2z2 It will be regarded as a constant , Only p 1 ∣ ∣ p 1 ∣ ∣ 2 \frac{p_1}{||p_1||_2} ∣∣p1∣∣2p1 There will be gradients , We can see that the collapse solution exists in the solution space .
The author explains the above optimization process , Suppose our loss function is 
F θ ( x ) F_\theta(x) Fθ(x) For neural networks , T ( x ) T(x) T(x) Indicates that the data x Do data enhancement , η x \eta_x ηx It can be regarded as a parameter to be estimated , The parameters to be estimated in the above formula are θ \theta θ、 η x \eta_x ηx,loss The specific optimization process of minimization is similar to Coordinate descent , As shown below 
η t − 1 \eta^{t-1} ηt−1 Express t-1 After secondary optimization , η \eta η Value , θ t \theta^t θt Empathy , First of all, will η t − 1 \eta^{t-1} ηt−1 As a constant , Get θ t \theta^{t} θt, In all θ \theta θ In value , L ( θ t , η t − 1 ) L(\theta^t,\eta^{t-1}) L(θt,ηt−1) The value will be the minimum , The same can be found in η t \eta^t ηt, In fact, it is the coordinate descent method . η t \eta^t ηt The mathematical expression of can be obtained by the following formula
∂ L ( θ , η ) ∂ η = − E T [ 2 ( F θ t ( T ( x ) ) − η x ) ] = 0 \frac{\partial L(\theta,\eta)}{\partial \eta}=-E_T[2(F_{\theta^t}(T(x))-\eta_x)]=0 ∂η∂L(θ,η)=−ET[2(Fθt(T(x))−ηx)]=0
Solution 
adopt Monte Carlo approximation , We can approximate it with a sample 
T ′ ( x ) T'(x) T′(x) Said to x Apply data enhancement , and T ( x ) T(x) T(x) It's the same , This writing is helpful for the subsequent writing of mathematical expressions , Substitute the above formula into formula 7 Available in 
The above formula can be regarded as a picture x x x Apply two data enhancements , obtain T ( x ) 、 T ′ ( x ) T(x)、T'(x) T(x)、T′(x), After neural network processing , Do in feature space L2 distance , Back propagation , F θ t ( T ′ ( x ) ) F_{\theta^t}(T'(x)) Fθt(T′(x)) Look, it becomes a constant . When F θ t ( T ′ ( x ) ) 、 F θ ( T ( x ) ) F_{\theta^t}(T'(x))、F_{\theta}(T(x)) Fθt(T′(x))、Fθ(T(x)) after L2 After normalization , The above formula can be compared with SimSiam Of loss Make an equivalent .
therefore ,SimSiam It can be regarded as an optimization problem with two parameter sets to be evaluated . To test the hypothesis , The author did a set of experiments , As shown below 
k-step Means to store k k k individual F θ t ( T ′ ( x ) ) F_{\theta^t}(T'(x)) Fθt(T′(x)), Think of it as a constant , fitting 11 Medium F θ ( T ( x ) ) F_{\theta}(T(x)) Fθ(T(x)) Conduct k The sub gradient update results in θ t + k \theta^{t+k} θt+k, Similar to optimization 7.0. Then optimize η \eta η, the F θ ( T ( x ) ) F_{\theta}(T(x)) Fθ(T(x)) As a constant , fitting 11 Medium F θ t + k ( T ′ ( x ) ) F_{\theta^{t+k}}(T'(x)) Fθt+k(T′(x)) Gradient update , Similar to optimization 8.0. You can see , The optimization result is very good , Proved the author's hypothesis .
In the above process , I deliberately omitted prediction MLP, Because of formula 10.0 It is the right form 9.0 A rough estimate of , So the author assumes that prediction MLP It makes up for the error caused by rough estimation , It is verified by experiments , No record here .
The algorithm pseudo code is as follows 
experiment
Verification is not recorded here SimSiam Relevant experiments that can avoid collapse , Only record some experimental results that are helpful to practice
SimSiam It is a comparative learning algorithm without negative examples , So it's right batch size The size of is insensitive , As shown below 
besides , The author proves that prediction MLP The role of , As shown below , so prediction MLP about SimSiam It's a huge impact 
besides , The author also explores in prediction MLP and projection MLP Add the output layer of BN Influence , As shown below ,BN Layer pair SimSiam The impact is also so significant ( Over your face ), look Comparative learning is extremely sensitive to some details .
边栏推荐
- 【1】 Introduction to redis
- Xshell suddenly failed to connect to the virtual machine
- Automatic scheduled backup of remote MySQL scripts
- word2vec和bert的基本使用方法
- tensorboard可视化
- CertPathValidatorException:validity check failed
- 分布式集群架构场景化解决方案:集群时钟同步问题
- 强化学习——多智能体强化学习
- Self attention learning notes
- 【2】 Redis basic commands and usage scenarios
猜你喜欢

Interface anti duplicate submission

强化学习——价值学习中的DQN

小程序开发哪家更靠谱呢?

self-attention学习笔记

强化学习——多智能体强化学习

Xshell suddenly failed to connect to the virtual machine

Four perspectives to teach you to choose applet development tools?

Construction of redis master-slave architecture

How to improve the efficiency of small program development?

分布式集群架构场景优化解决方案:分布式调度问题
随机推荐
SQLAlchemy使用相关
强化学习——基础概念
小程序开发如何提高效率?
Tensorboard visualization
Distinguish between real-time data, offline data, streaming data and batch data
vscode uniapp
Digital collections strengthen reality with emptiness, enabling the development of the real economy
Invalid packaging for parent POM x, must be “pom“ but is “jar“ @
强化学习——Proximal Policy Optimization Algorithms
强化学习——价值学习中的DQN
What should we pay attention to when making template application of wechat applet?
【4】 Redis persistence (RDB and AOF)
Nlp项目实战自定义模板框架
uniapp webview监听页面加载后回调
matplotlib数据可视化
4个角度教你选小程序开发工具?
分布式集群架构场景优化解决方案:Session共享问题
卷积神经网络
Matplotlib data visualization
On July 7, the national wind 24 solar terms "Xiaoshu" came!! Attachment.. cooperation.. completion.. advance.. report