当前位置:网站首页>Deep learning parameter initialization (I) Xavier initialization with code
Deep learning parameter initialization (I) Xavier initialization with code
2022-07-03 07:04:00 【Xiaoshu Xiaoshu】
Catalog
3、 ... and 、 Standard initialization method
Four 、Xavier Initialization assumptions
5、 ... and 、Xavier Simple formula derivation of initialization :
6、 ... and 、Pytorch Realization :
7、 ... and 、 Comparative experiments
1. Histogram of activation value of each layer
2. Gradient of back propagation of each layer ( About the gradient of states ) Distribution of
3. Distribution of parameter gradient in each layer
4. Distribution of weight gradient variance of each layer
One 、 brief introduction
Xavier Initialization is also called Glorot initialization , Because the inventor was Xavier Glorot.Xavier initialization yes Glorot Another initialization method proposed by et al. In order to solve the problem of random initialization , Their idea is to make input and output obey the same distribution as much as possible , In this way, the output value of the activation function in the later layer can be avoided to tend to 0.
Because weights are usually initialized with Gauss or uniform distribution , And there won't be much difference between the two , As long as the variance of the two is the same , So let's say Gaussian and uniform distribution together .
Pytorch There are already implementations in , I'll give you a detailed introduction :
torch.nn.init.xavier_uniform_(tensor: Tensor, gain: float = 1.)
torch.nn.init.xavier_uniform_(tensor: Tensor, gain: float = 1.)
Two 、 Basic knowledge of
1. Variance of uniform distribution :

2. Suppose that random variables X And random variables Y Are independent of each other , Then there are

3. Suppose that random variables X And random variables Y Are independent of each other , And E(X)=E(Y)=0, Then there are

3、 ... and 、 Standard initialization method
When the weight initialization meets the uniform distribution :
Because the formula (1) The variance of :
, So the corresponding Gaussian distribution is written :
For a fully connected network , Let's enter X Every dimension of x As a random variable , And suppose E(x)=0,Var(x)=1. Suppose the weight W And the input X Are independent of each other , Then the variance of the hidden layer state is :

It can be seen that the standard initialization method has a very good feature : The mean value of the state of the hidden layer is 0, The variance is constant 1/3, It has nothing to do with the number of layers of the network , This means that for sigmoid For such a function , The independent variable falls within the gradient range .
But because sigmoid Activation values are greater than 0 Of , It will cause the input of the next layer to be unsatisfied E(x)=0. In fact, the standard initialization is only applicable to meet the following Glorot Assumed activation function , such as tanh.
Four 、Xavier Initialization assumptions
At the beginning of the article, we give the necessary conditions for parameter initialization . But these two conditions only ensure that useful information can be learned during the training —— The parameter gradient is not 0( Because the parameter is controlled in the effective area of the active function ). and Glorot Think : Excellent initialization should make the activation value of each layer consistent with the variance of the state gradient in the propagation process . That is to say, we should ensure that the variance of the parameters of each layer of forward propagation is consistent with that of each layer of back propagation :

We call these two conditions Glorot Conditions .
combined , Now let's make the following assumptions :
1. The variance of each feature input is the same :Var(x);
2. Activate function symmetry : In this way, it can be assumed that the input mean value of each layer is 0;
3.f′(0)=1
4. At the beginning , The state value falls in the linear region of the activation function :f′(Si(k))≈1.
The last three are all assumptions about activation functions , We call it Glorot The activation function assumes .
5、 ... and 、Xavier The initialization of the Simple formula derivation :
First, the expressions of the gradient of the state and the gradient of the parameter are given :

Let's take the fully connected layer as an example , Expression for :

among ni Indicates the number of inputs .
According to the knowledge of probability and statistics, we have the following variance formula :
Special , When we assume that the input and weight are 0 Mean time ( There is now BN after , This is also easier to meet ), The above formula can be simplified as :

Assume that the input x And weight w Independent homologous distribution , In order to ensure that the input and output variances are consistent , There should be :

For a multi-layer network , The variance of a certain layer can be expressed in the form of accumulation ,
Is the current number of layers :

Special , Back propagation has a similar form when calculating the gradient :

Sum up , In order to ensure that the variance of each layer is consistent in forward propagation and back propagation , Should satisfy :

however , In practice, the number of inputs and outputs is often not equal , So for balance , We will input and output l The variance of the two layers is taken as the mean , Finally, our weight variance should meet :

therefore Xavier Initialized Gaussian distribution formula :

According to the variance formula of uniform distribution :
And because here |a|=|b|, therefore Xavier The implementation of initialization is the following uniform distribution :

6、 ... and 、Pytorch Realization :
import torch
# Defining models Three layer convolution One layer full connection
class DemoNet(torch.nn.Module):
def __init__(self):
super(DemoNet, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 1, 3)
print('random init:', self.conv1.weight)
'''
xavier The initialization method obeys uniform distribution U(−a,a) , Parameters of the distribution a = gain * sqrt(6/fan_in+fan_out),
Here's one gain, The gain is set according to the type of activation function , This initialization method , Also known as Glorot initialization
'''
torch.nn.init.xavier_uniform_(self.conv1.weight, gain=1)
print('xavier_uniform_:', self.conv1.weight)
'''
xavier The initialization method obeys the normal distribution ,
mean=0,std = gain * sqrt(2/fan_in + fan_out)
'''
torch.nn.init.xavier_uniform_(self.conv1.weight, gain=1)
print('xavier_uniform_:', self.conv1.weight)
if __name__ == '__main__':
demoNet = DemoNet()
7、 ... and 、 Comparative experiments
Experimental use tanh Is the activation function
1. Histogram of activation value of each layer

The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The activation values of each layer of the initialized network are relatively consistent , And the values are smaller than the original standard initialization .
2. Gradient of back propagation of each layer ( About the gradient of states ) Distribution of

The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The gradient of each layer of the initialized network is relatively consistent , And the values are smaller than the original standard initialization . The author suspects that different gradients on different layers may lead to morbidity or slow training .
3. Distribution of parameter gradient in each layer

Formula (3) It has been proved that the variance of the parameter gradient of each layer is basically independent of the number of layers . The above figure shows the original initialization , The picture below is Xavier initialization . We find that the standard initialization parameter gradient in the figure below is one order of magnitude smaller .
4. Distribution of weight gradient variance of each layer

The above figure shows the original initialization , The picture below is Xavier initialization .Xavier The variance of initialization weight gradient is consistent .
8、 ... and 、 summary
1.Xavier Initialized Gaussian distribution formula :

2.Xavier Initialized uniform distribution formula :

3.Xavier Initialization is based on the standard initialization method , It takes into account the parameter variance of each layer in forward propagation and shared propagation .
4.Xavier Initial shortcomings : because Xavier The derivation of is based on several assumptions , One of them is that the activation function is linear . This does not apply to ReLU Activation function . The other is the activation value about 0 symmetry , This does not apply to sigmoid Functions and ReLU function . In the use of sigmoid Functions and ReLU Function time , Standard initialization and Xavier Initial activation obtained by initialization 、 The parameter gradient characteristics are the same . The variance of the activation value decreases layer by layer , The variance of the parameter gradient also decreases layer by layer .
边栏推荐
- Software testing learning - day 3
- PHP install the spool extension
- Laravel Web框架
- 10000小時定律不會讓你成為編程大師,但至少是個好的起點
- The essence of interview
- [set theory] partition (partition | partition example | partition and equivalence relationship)
- C2338 Cannot format an argument. To make type T formattable provide a formatter<T> specialization:
- Stream stream
- Inno setup production and installation package
- [attribute comparison] defer and async
猜你喜欢

golang操作redis:写入、读取hash类型数据

DBNet:具有可微分二值化的实时场景文本检测

Personally design a highly concurrent seckill system

2022-06-23 VGMP-OSPF-域間安全策略-NAT策略(更新中)

Winter vacation work of software engineering practice

Software testing assignment - the next day

Summary of the design and implementation of the weapon system similar to the paladin of vitality

La loi des 10 000 heures ne fait pas de vous un maître de programmation, mais au moins un bon point de départ

JMeter JSON extractor extracts two parameters at the same time

Inno setup production and installation package
随机推荐
crontab定时任务
Shim and Polyfill in [concept collection]
万卷书 - 价值投资者指南 [The Education of a Value Investor]
php安装swoole扩展
Basic components and intermediate components
[Code] if (list! = null & list. Size() > 0) optimization, set empty judgment implementation method
2021 year end summary
[set theory] equivalence classes (concept of equivalence classes | examples of equivalence classes | properties of equivalence classes | quotient sets | examples of quotient sets)*
Laravel Web Framework
The essence of interview
Error c2017: illegal escape sequence
【code】偶尔取值、判空、查表、验证等
【类和对象】深入浅出类和对象
Abstract learning
机器学习 | 简单但是能提升模型效果的特征标准化方法(RobustScaler、MinMaxScaler、StandardScaler 比较和解析)
Thoughts on project development
Crontab scheduled task
Selenium key knowledge explanation
JS date comparison
Resttemplate configuration use