当前位置：网站首页>Deep learning parameter initialization (I) Xavier initialization with code

Deep learning parameter initialization (I) Xavier initialization with code

2022-07-03 07:04:00 【Xiaoshu Xiaoshu】

Catalog

One 、 brief introduction

Two 、 Basic knowledge of

3、 ... and 、 Standard initialization method

Four 、Xavier Initialization assumptions

5、 ... and 、Xavier Simple formula derivation of initialization ：

6、 ... and 、Pytorch Realization ：

7、 ... and 、 Comparative experiments

1. Histogram of activation value of each layer

2. Gradient of back propagation of each layer （ About the gradient of states ） Distribution of

3. Distribution of parameter gradient in each layer

4. Distribution of weight gradient variance of each layer

8、 ... and 、 summary

One 、 brief introduction

Xavier Initialization is also called Glorot initialization , Because the inventor was Xavier Glorot.Xavier initialization yes Glorot Another initialization method proposed by et al. In order to solve the problem of random initialization , Their idea is to make input and output obey the same distribution as much as possible , In this way, the output value of the activation function in the later layer can be avoided to tend to 0.

Because weights are usually initialized with Gauss or uniform distribution , And there won't be much difference between the two , As long as the variance of the two is the same , So let's say Gaussian and uniform distribution together .

Pytorch There are already implementations in , I'll give you a detailed introduction ：

torch.nn.init.xavier_uniform_(tensor: Tensor, gain: float = 1.)
torch.nn.init.xavier_uniform_(tensor: Tensor, gain: float = 1.)

Two 、 Basic knowledge of

1. Variance of uniform distribution ：

$gif.latex?Var%20%3D%20%5Cfrac%7B%28b-a%29%5E%7B2%7D%7D%7B12%7D$

2. Suppose that random variables X And random variables Y Are independent of each other , Then there are

gif.latex?Var%28X+Y%29%3DVar%28X%29+Var%28Y%29

3. Suppose that random variables X And random variables Y Are independent of each other , And E(X)=E(Y)=0, Then there are

gif.latex?Var%28X*Y%29%20%3D%20Var%28X%29*Var%28Y%29

3、 ... and 、 Standard initialization method

When the weight initialization meets the uniform distribution ：

$gif.latex?W_%7Bi%2Cj%7D%5Csim%20U%5B-%5Cfrac%7B1%7D%7B%5Csqrt%7Bn%7D%7D%2C%5Cfrac%7B1%7D%7B%5Csqrt%7Bn%7D%7D%5D$

Because the formula (1) The variance of ： $gif.latex?Var%28W_%7Bi%2Cj%7D%29%3D%5Cfrac%7B1%7D%7B3n%7D$ , So the corresponding Gaussian distribution is written ：

$gif.latex?W_%7Bt%2Cj%7D%5Csim%20N%5B0%2C%5Cfrac%7B1%7D%7B3n%7D%5D$

For a fully connected network , Let's enter X Every dimension of x As a random variable , And suppose E(x)=0,Var(x)=1. Suppose the weight W And the input X Are independent of each other , Then the variance of the hidden layer state is ：

It can be seen that the standard initialization method has a very good feature ： The mean value of the state of the hidden layer is 0, The variance is constant 1/3, It has nothing to do with the number of layers of the network , This means that for sigmoid For such a function , The independent variable falls within the gradient range .

But because sigmoid Activation values are greater than 0 Of , It will cause the input of the next layer to be unsatisfied E(x)=0. In fact, the standard initialization is only applicable to meet the following Glorot Assumed activation function , such as tanh.

Four 、Xavier Initialization assumptions

At the beginning of the article, we give the necessary conditions for parameter initialization . But these two conditions only ensure that useful information can be learned during the training —— The parameter gradient is not 0（ Because the parameter is controlled in the effective area of the active function ）. and Glorot Think ： Excellent initialization should make the activation value of each layer consistent with the variance of the state gradient in the propagation process . That is to say, we should ensure that the variance of the parameters of each layer of forward propagation is consistent with that of each layer of back propagation ：

We call these two conditions Glorot Conditions .

combined , Now let's make the following assumptions ：

1. The variance of each feature input is the same ：Var(x);
2. Activate function symmetry ： In this way, it can be assumed that the input mean value of each layer is 0;
3.f′(0)=1
4. At the beginning , The state value falls in the linear region of the activation function ：f′(Si(k))≈1.
The last three are all assumptions about activation functions , We call it Glorot The activation function assumes .

5、 ... and 、Xavier The initialization of the Simple formula derivation ：

First, the expressions of the gradient of the state and the gradient of the parameter are given ：

Let's take the fully connected layer as an example , Expression for ：

among ni Indicates the number of inputs .

According to the knowledge of probability and statistics, we have the following variance formula ：

Special , When we assume that the input and weight are 0 Mean time （ There is now BN after , This is also easier to meet ）, The above formula can be simplified as ：

gif.latex?%5Cdpi%7B150%7D%20Var%28w_%7Bi%7Dx_%7Bi%7D%29%20%3D%20Var%28w_%7Bi%7D%29Var%28x_%7Bi%7D%29

Assume that the input x And weight w Independent homologous distribution , In order to ensure that the input and output variances are consistent , There should be ：

$gif.latex?%5Cdpi%7B150%7D%20Var%28w_%7Bj%7D%29%3D%5Cfrac%7B1%7D%7Bn_%7Bi%7D%7D$

For a multi-layer network , The variance of a certain layer can be expressed in the form of accumulation , gif.latex?%5Cdpi%7B150%7D%20i%5E%7B%27%7D Is the current number of layers ：

Special , Back propagation has a similar form when calculating the gradient ：

$gif.latex?%5Cdpi%7B150%7D%20Var%5B%5Cfrac%7B%5Cpartial%20Cost%7D%7B%5Cpartial%20s%5E%7Bi%7D%7D%5D%3DVar%5B%5Cfrac%7B%5Cpartial%20Cost%7D%7B%5Cpartial%20s%5E%7Bd%7D%7D%5D%5Cprod_%7Bi%5E%7B%27%7D%3Di%7D%5E%7Bd%7Dn_%7Bi%5E%7B%27%7D+1%7DVar%5BW%5E%7Bi%5E%7B%27%7D%7D%5D$

Sum up , In order to ensure that the variance of each layer is consistent in forward propagation and back propagation , Should satisfy ：

gif.latex?%5Cdpi%7B200%7D%20%5Cforall%20i%2Cn_%7Bi%7DVar%5BW%5E%7Bi%7D%5D%3D1

gif.latex?%5Cdpi%7B200%7D%20%5Cforall%20i%2Cn_%7Bi+1%7DVar%5BW%5E%7Bi%7D%5D%3D1

however , In practice, the number of inputs and outputs is often not equal , So for balance , We will input and output l The variance of the two layers is taken as the mean , Finally, our weight variance should meet ：

$gif.latex?%5Cdpi%7B200%7D%20%5Cforall%20i%2C%20Var%5BW%5E%7Bi%7D%5D%3D%5Cfrac%7B2%7D%7Bn_%7B1%7D+n_%7Bi+1%7D%7D$

therefore Xavier Initialized Gaussian distribution formula ：

$gif.latex?%5Cdpi%7B200%7D%20W%5Csim%20N%5B0%2C%5Cfrac%7B2%7D%7Bn_%7Bi%7D+n_%7Bi+1%7D%7D%5D$

According to the variance formula of uniform distribution : $gif.latex?Var%20%3D%20%5Cfrac%7B%28b-a%29%5E%7B2%7D%7D%7B12%7D$

And because here |a|=|b|, therefore Xavier The implementation of initialization is the following uniform distribution ：

$gif.latex?%5Cdpi%7B150%7D%20W%5Csim%20U%5B-%5Csqrt%7B%5Cfrac%7B6%7D%7Bn_%7Bi%7D+n_%7Bi+1%7D%7D%7D%2C%5Csqrt%7B%5Cfrac%7B6%7D%7Bn_%7Bi%7D+n_%7Bi+1%7D%7D%7D%5D$

6、 ... and 、Pytorch Realization ：

import torch

#  Defining models   Three layer convolution   One layer full connection 
class DemoNet(torch.nn.Module):
    def __init__(self):
        super(DemoNet, self).__init__()
        self.conv1 = torch.nn.Conv2d(1, 1, 3)
        print('random init:', self.conv1.weight)
        '''
        xavier  The initialization method obeys uniform distribution  U(−a,a) , Parameters of the distribution  a = gain * sqrt(6/fan_in+fan_out),
         Here's one  gain, The gain is set according to the type of activation function , This initialization method , Also known as  Glorot initialization
        '''
        torch.nn.init.xavier_uniform_(self.conv1.weight, gain=1)
        print('xavier_uniform_:', self.conv1.weight)
        '''
            xavier  The initialization method obeys the normal distribution ,
            mean=0,std = gain * sqrt(2/fan_in + fan_out)
        '''
        torch.nn.init.xavier_uniform_(self.conv1.weight, gain=1)
        print('xavier_uniform_:', self.conv1.weight)


if __name__ == '__main__':
    demoNet = DemoNet()

7、 ... and 、 Comparative experiments

Experimental use tanh Is the activation function

1. Histogram of activation value of each layer

The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The activation values of each layer of the initialized network are relatively consistent , And the values are smaller than the original standard initialization .

2. Gradient of back propagation of each layer （ About the gradient of states ） Distribution of

The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The gradient of each layer of the initialized network is relatively consistent , And the values are smaller than the original standard initialization . The author suspects that different gradients on different layers may lead to morbidity or slow training .

3. Distribution of parameter gradient in each layer

Formula （3） It has been proved that the variance of the parameter gradient of each layer is basically independent of the number of layers . The above figure shows the original initialization , The picture below is Xavier initialization . We find that the standard initialization parameter gradient in the figure below is one order of magnitude smaller .

4. Distribution of weight gradient variance of each layer

The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The variance of initialization weight gradient is consistent .

8、 ... and 、 summary

1.Xavier Initialized Gaussian distribution formula ：

$gif.latex?W%5Csim%20N%5B0%2C%5Cfrac%7B2%7D%7Bn_%7Bi%7D+n_%7Bi+1%7D%7D%5D$

2.Xavier Initialized uniform distribution formula ：

$gif.latex?W%5Csim%20U%5B-%5Csqrt%7B%5Cfrac%7B6%7D%7Bn_%7Bi%7D+n_%7Bi+1%7D%7D%7D%2C%5Csqrt%7B%5Cfrac%7B6%7D%7Bn_%7Bi%7D+n_%7Bi+1%7D%7D%7D%5D$

3.Xavier Initialization is based on the standard initialization method , It takes into account the parameter variance of each layer in forward propagation and shared propagation .

4.Xavier Initial shortcomings ： because Xavier The derivation of is based on several assumptions , One of them is that the activation function is linear . This does not apply to ReLU Activation function . The other is the activation value about 0 symmetry , This does not apply to sigmoid Functions and ReLU function . In the use of sigmoid Functions and ReLU Function time , Standard initialization and Xavier Initial activation obtained by initialization 、 The parameter gradient characteristics are the same . The variance of the activation value decreases layer by layer , The variance of the parameter gradient also decreases layer by layer .

原网站

版权声明
本文为[Xiaoshu Xiaoshu]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207030658019574.html