当前位置：网站首页>Deep learning (self supervision: simple Siam) -- Exploring simple Siamese representation learning

Deep learning (self supervision: simple Siam) -- Exploring simple Siamese representation learning

2022-07-28 06:09:00 【Food to doubt life】

List of articles

Preface
SimSiam sketch
experiment

Preface

This article was published by he Kaiming group in CVPR2021 Articles on , At present, it has been nominated for the best paper , It mainly solves the problem of collapse in self supervised comparative learning . Collapse means no matter what input , The feature vectors output by the feature extractor are the same .

This article will briefly introduce SimSiam, Record the interesting experimental results .

The author did not explain why SimSiam Can avoid collapse , But the article is really brilliant

SimSiam sketch

Insert picture description here
The figure above shows SimSiam Overall structure , To be specific

For the input image x Apply data enhancement , obtain $x_1$ 、 $x_2$
take $x_1$ 、 $x_2$ Input into the same feature extractor , And through a projection MLP Get processed $z_1$ 、 $z_2$
$z_1$ after prediction MLP Handle , obtain $p_1$

Contrast learning loss by
Insert picture description here
Back propagation , $\frac{z_2}{||z_2||_2}$ It will be regarded as a constant , Only $\frac{p_1}{||p_1||_2}$ There will be gradients , We can see that the collapse solution exists in the solution space .

The author explains the above optimization process , Suppose our loss function is
Insert picture description here
$F_\theta(x)$ For neural networks , $T (x)$ Indicates that the data x Do data enhancement , $\eta_x$ It can be regarded as a parameter to be estimated , The parameters to be estimated in the above formula are $\theta$ 、 $\eta_x$ ,loss The specific optimization process of minimization is similar to Coordinate descent , As shown below
Insert picture description here

$\eta^{t-1}$ Express t-1 After secondary optimization , $\eta$ Value , $\theta^t$ Empathy , First of all, will $\eta^{t-1}$ As a constant , Get $\theta^{t}$ , In all $\theta$ In value , $L(\theta^t,\eta^{t-1})$ The value will be the minimum , The same can be found in $\eta^t$ , In fact, it is the coordinate descent method . $\eta^t$ The mathematical expression of can be obtained by the following formula
$\frac{\partial L(\theta,\eta)}{\partial \eta}=-E_T[2(F_{\theta^t}(T(x))-\eta_x)]=0$
Solution
Insert picture description here

adopt Monte Carlo approximation , We can approximate it with a sample
Insert picture description here

$T^{'} (x)$ Said to x Apply data enhancement , and $T (x)$ It's the same , This writing is helpful for the subsequent writing of mathematical expressions , Substitute the above formula into formula 7 Available in
Insert picture description here

The above formula can be regarded as a picture $x$ Apply two data enhancements , obtain $T (x) 、 T^{'} (x)$ , After neural network processing , Do in feature space L2 distance , Back propagation , $F_{\theta^t}(T'(x))$ Look, it becomes a constant . When $F_{\theta^t}(T'(x))、F_{\theta}(T(x))$ after L2 After normalization , The above formula can be compared with SimSiam Of loss Make an equivalent .

therefore ,SimSiam It can be regarded as an optimization problem with two parameter sets to be evaluated . To test the hypothesis , The author did a set of experiments , As shown below
Insert picture description here
k-step Means to store $k$ individual $F_{\theta^t}(T'(x))$ , Think of it as a constant , fitting 11 Medium $F_{\theta}(T(x))$ Conduct k The sub gradient update results in $\theta^{t+k}$ , Similar to optimization 7.0. Then optimize $\eta$ , the $F_{\theta}(T(x))$ As a constant , fitting 11 Medium $F_{\theta^{t+k}}(T'(x))$ Gradient update , Similar to optimization 8.0. You can see , The optimization result is very good , Proved the author's hypothesis .

In the above process , I deliberately omitted prediction MLP, Because of formula 10.0 It is the right form 9.0 A rough estimate of , So the author assumes that prediction MLP It makes up for the error caused by rough estimation , It is verified by experiments , No record here .

The algorithm pseudo code is as follows
Insert picture description here

experiment

Verification is not recorded here SimSiam Relevant experiments that can avoid collapse , Only record some experimental results that are helpful to practice

SimSiam It is a comparative learning algorithm without negative examples , So it's right batch size The size of is insensitive , As shown below
Insert picture description here

besides , The author proves that prediction MLP The role of , As shown below , so prediction MLP about SimSiam It's a huge impact
Insert picture description here
besides , The author also explores in prediction MLP and projection MLP Add the output layer of BN Influence , As shown below ,BN Layer pair SimSiam The impact is also so significant （ Over your face ）, look Comparative learning is extremely sensitive to some details .
Insert picture description here