当前位置:网站首页>Deep learning parameter initialization (I) Xavier initialization with code
Deep learning parameter initialization (I) Xavier initialization with code
2022-07-03 07:04:00 【Xiaoshu Xiaoshu】
Catalog
3、 ... and 、 Standard initialization method
Four 、Xavier Initialization assumptions
5、 ... and 、Xavier Simple formula derivation of initialization :
6、 ... and 、Pytorch Realization :
7、 ... and 、 Comparative experiments
1. Histogram of activation value of each layer
2. Gradient of back propagation of each layer ( About the gradient of states ) Distribution of
3. Distribution of parameter gradient in each layer
4. Distribution of weight gradient variance of each layer
One 、 brief introduction
Xavier Initialization is also called Glorot initialization , Because the inventor was Xavier Glorot.Xavier initialization yes Glorot Another initialization method proposed by et al. In order to solve the problem of random initialization , Their idea is to make input and output obey the same distribution as much as possible , In this way, the output value of the activation function in the later layer can be avoided to tend to 0.
Because weights are usually initialized with Gauss or uniform distribution , And there won't be much difference between the two , As long as the variance of the two is the same , So let's say Gaussian and uniform distribution together .
Pytorch There are already implementations in , I'll give you a detailed introduction :
torch.nn.init.xavier_uniform_(tensor: Tensor, gain: float = 1.)
torch.nn.init.xavier_uniform_(tensor: Tensor, gain: float = 1.)
Two 、 Basic knowledge of
1. Variance of uniform distribution :
2. Suppose that random variables X And random variables Y Are independent of each other , Then there are
3. Suppose that random variables X And random variables Y Are independent of each other , And E(X)=E(Y)=0, Then there are
3、 ... and 、 Standard initialization method
When the weight initialization meets the uniform distribution :
Because the formula (1) The variance of :, So the corresponding Gaussian distribution is written :
For a fully connected network , Let's enter X Every dimension of x As a random variable , And suppose E(x)=0,Var(x)=1. Suppose the weight W And the input X Are independent of each other , Then the variance of the hidden layer state is :
It can be seen that the standard initialization method has a very good feature : The mean value of the state of the hidden layer is 0, The variance is constant 1/3, It has nothing to do with the number of layers of the network , This means that for sigmoid For such a function , The independent variable falls within the gradient range .
But because sigmoid Activation values are greater than 0 Of , It will cause the input of the next layer to be unsatisfied E(x)=0. In fact, the standard initialization is only applicable to meet the following Glorot Assumed activation function , such as tanh.
Four 、Xavier Initialization assumptions
At the beginning of the article, we give the necessary conditions for parameter initialization . But these two conditions only ensure that useful information can be learned during the training —— The parameter gradient is not 0( Because the parameter is controlled in the effective area of the active function ). and Glorot Think : Excellent initialization should make the activation value of each layer consistent with the variance of the state gradient in the propagation process . That is to say, we should ensure that the variance of the parameters of each layer of forward propagation is consistent with that of each layer of back propagation :
We call these two conditions Glorot Conditions .
combined , Now let's make the following assumptions :
1. The variance of each feature input is the same :Var(x);
2. Activate function symmetry : In this way, it can be assumed that the input mean value of each layer is 0;
3.f′(0)=1
4. At the beginning , The state value falls in the linear region of the activation function :f′(Si(k))≈1.
The last three are all assumptions about activation functions , We call it Glorot The activation function assumes .
5、 ... and 、Xavier The initialization of the Simple formula derivation :
First, the expressions of the gradient of the state and the gradient of the parameter are given :
Let's take the fully connected layer as an example , Expression for :
among ni Indicates the number of inputs .
According to the knowledge of probability and statistics, we have the following variance formula :
Special , When we assume that the input and weight are 0 Mean time ( There is now BN after , This is also easier to meet ), The above formula can be simplified as :
Assume that the input x And weight w Independent homologous distribution , In order to ensure that the input and output variances are consistent , There should be :
For a multi-layer network , The variance of a certain layer can be expressed in the form of accumulation , Is the current number of layers :
Special , Back propagation has a similar form when calculating the gradient :
Sum up , In order to ensure that the variance of each layer is consistent in forward propagation and back propagation , Should satisfy :
however , In practice, the number of inputs and outputs is often not equal , So for balance , We will input and output l The variance of the two layers is taken as the mean , Finally, our weight variance should meet :
therefore Xavier Initialized Gaussian distribution formula :
According to the variance formula of uniform distribution :
And because here |a|=|b|, therefore Xavier The implementation of initialization is the following uniform distribution :
6、 ... and 、Pytorch Realization :
import torch
# Defining models Three layer convolution One layer full connection
class DemoNet(torch.nn.Module):
def __init__(self):
super(DemoNet, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 1, 3)
print('random init:', self.conv1.weight)
'''
xavier The initialization method obeys uniform distribution U(−a,a) , Parameters of the distribution a = gain * sqrt(6/fan_in+fan_out),
Here's one gain, The gain is set according to the type of activation function , This initialization method , Also known as Glorot initialization
'''
torch.nn.init.xavier_uniform_(self.conv1.weight, gain=1)
print('xavier_uniform_:', self.conv1.weight)
'''
xavier The initialization method obeys the normal distribution ,
mean=0,std = gain * sqrt(2/fan_in + fan_out)
'''
torch.nn.init.xavier_uniform_(self.conv1.weight, gain=1)
print('xavier_uniform_:', self.conv1.weight)
if __name__ == '__main__':
demoNet = DemoNet()
7、 ... and 、 Comparative experiments
Experimental use tanh Is the activation function
1. Histogram of activation value of each layer
The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The activation values of each layer of the initialized network are relatively consistent , And the values are smaller than the original standard initialization .
2. Gradient of back propagation of each layer ( About the gradient of states ) Distribution of
The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The gradient of each layer of the initialized network is relatively consistent , And the values are smaller than the original standard initialization . The author suspects that different gradients on different layers may lead to morbidity or slow training .
3. Distribution of parameter gradient in each layer
Formula (3) It has been proved that the variance of the parameter gradient of each layer is basically independent of the number of layers . The above figure shows the original initialization , The picture below is Xavier initialization . We find that the standard initialization parameter gradient in the figure below is one order of magnitude smaller .
4. Distribution of weight gradient variance of each layer
The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The variance of initialization weight gradient is consistent .
8、 ... and 、 summary
1.Xavier Initialized Gaussian distribution formula :
2.Xavier Initialized uniform distribution formula :
3.Xavier Initialization is based on the standard initialization method , It takes into account the parameter variance of each layer in forward propagation and shared propagation .
4.Xavier Initial shortcomings : because Xavier The derivation of is based on several assumptions , One of them is that the activation function is linear . This does not apply to ReLU Activation function . The other is the activation value about 0 symmetry , This does not apply to sigmoid Functions and ReLU function . In the use of sigmoid Functions and ReLU Function time , Standard initialization and Xavier Initial activation obtained by initialization 、 The parameter gradient characteristics are the same . The variance of the activation value decreases layer by layer , The variance of the parameter gradient also decreases layer by layer .
边栏推荐
- Golang operation redis: write and read kV data
- DBNet:具有可微分二值化的实时场景文本检测
- Search engine Bing Bing advanced search skills
- golang操作redis:写入、读取kv数据
- Application scenarios of Catalan number
- C2338 Cannot format an argument. To make type T formattable provide a formatter<T> specialization:
- How to migrate or replicate VMware virtual machine systems
- [attribute comparison] defer and async
- 每日刷題記錄 (十一)
- [vscode - vehicle plug-in reports an error] cannot find module 'xxx' or its corresponding type declarations Vetur(2307)
猜你喜欢
Mise en place d'un environnement de développement de fonctions personnalisées
多个全局异常处理类,怎么规定执行顺序
dataworks自定義函數開發環境搭建
golang操作redis:写入、读取kv数据
4279. 笛卡尔树
POI excel percentage
vmware虚拟机C盘扩容
Software testing assignment - the next day
Software testing assignment - day 1
Golang operation redis: write and read kV data
随机推荐
2022-06-23 VGMP-OSPF-域間安全策略-NAT策略(更新中)
La loi des 10 000 heures ne fait pas de vous un maître de programmation, mais au moins un bon point de départ
EasyExcel
2022-06-23 VGMP-OSPF-域间安全策略-NAT策略(更新中)
Sorting out the core ideas of the pyramid principle
PHP install the spool extension
Asynchronous programming: async/await in asp Net
Reading notes of "learn to ask questions"
Selenium key knowledge explanation
熊市里的大机构压力倍增,灰度、Tether、微策略等巨鲸会不会成为'巨雷'?
10 000 volumes - Guide de l'investisseur en valeur [l'éducation d'un investisseur en valeur]
golang操作redis:写入、读取kv数据
UTC时间、GMT时间、CST时间
On the practice of performance optimization and stability guarantee
JMeter JSON extractor extracts two parameters at the same time
MySQL syntax (basic)
Integration test practice (1) theoretical basis
Liang Ning: 30 lectures on brain map notes for growth thinking
Advanced API (multithreading 02)
Advanced API (use of file class)