当前位置:网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
2022-07-07 00:36:00 【ddrrnnpp】
scene : Data sets : Official fashionminst + The Internet :alexnet+pytroch+relu Activation function
Source code :https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
Knowledge point : Gradient explosion , Gradient dispersion
Learning literature ( Keep up with the big guys ) Yes :
https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
Experimental phenomena :
Phenomenon one :
1、 As soon as the code starts running, the following occurs 
Phenomenon two :
2、 After I try to reduce the learning rate , Appear halfway loss nan
Phenomenon three :
Friends a: Sometimes running is no problem , The network has not changed much , Sometimes there are loss nan, Sometimes it doesn't appear
Friends b: Possible causes : Influence of random initialization variable value
Friends a: Try a solution : Changed random seeds , The rounds that appear are just changed 
My shortest path solution : Join in BN layer ( What eats mice is a good cat ,hhh)
https://zh-v2.d2l.ai/chapter_convolutional-modern/batch-norm.html
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# adopt is_grad_enabled To determine whether the current mode is training mode or prediction mode
if not torch.is_grad_enabled():
# If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# Use of two-dimensional convolution , Calculate the channel dimension (axis=1) The mean and variance of .
# Here we need to keep X So that the broadcast operation can be done later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode , Standardize with the current mean and variance
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance of the moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Zoom and shift
return Y, moving_mean.data, moving_var.data
class BatchNorm(nn.Module):
# num_features: The number of outputs of the fully connected layer or the number of output channels of the convolution layer .
# num_dims:2 Represents a fully connected layer ,4 Represents a convolution layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# Stretch and offset parameters involved in gradient sum iteration , Initialize into 1 and 0
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# Variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
def forward(self, X):
# If X Not in memory , take moving_mean and moving_var
# Copied to the X On the video memory
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated moving_mean and moving_var
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y

csdn Other solutions :

Principle 1 :
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
1、 Gradient derivation + The chain rule


1.1、relu Derivation property of activation function + Gradient explosion

1、relu The derivative of the activation function of 1 or 0
2、 Gradient explosion : Because of the chain rule of derivatives , continuity Multiple layers are greater than 1 The gradient of the product Will make the gradient bigger and bigger , Eventually, the gradient is too large .
3、 Gradient explosion Will make the parameters of a certain layer w Too big , Cause network instability , In extreme cases , Multiply the data by a big w There is an overflow , obtain NAN value .
1.2、 The problem of gradient explosion :

2.1、sigmoid Derivation property of activation function + The gradient disappears 
1、 Because of the chain rule of derivatives , In successive layers , take Less than 1 The gradient of the product Will make the gradient smaller and smaller , Finally, the gradient in the first layer is 0.
2.2、 The problem of gradient disappearance :

Analysis of experimental phenomena :
1、relu Activation function
2、 Adjusting the learning rate can make the network output halfway nan
------》 Conclusion : Gradient explosion
Principle 2 :
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
1、alexnet Relatively deep network :

2、 What is extracted from batch normalization is “ Small batch ”, With certain Randomness . a certain extent , The small batch here will To the network Bring some The noise Come on Control model complexity .
3、 After batch normalization ,lr The learning rate can be set to a large number , have Accelerate convergence The role of
Thank you very much for Li Mu's explanation video !!!!, This paper starts with a practical problem , Understand the knowledge points explained by the boss . It's unique
, If there is any infringement 、 The same 、 The error of !!, Please give me some advice !!!!!

边栏推荐
- MySQL learning notes (mind map)
- Cross-entrpy Method
- Everyone is always talking about EQ, so what is EQ?
- Mujoco second order simple pendulum modeling and control
- vector的使用方法_vector指针如何使用
- Lombok 同时使⽤ @Data 和 @Builder 的坑,你中招没?
- Google, Baidu and Yahoo are general search engines developed by Chinese companies_ Baidu search engine URL
- VTK volume rendering program design of 3D scanned volume data
- 《LaTex》LaTex数学公式简介「建议收藏」
- PostgreSQL uses pgpool II to realize read-write separation + load balancing
猜你喜欢

英雄联盟|王者|穿越火线 bgm AI配乐大赛分享

The difference between redirectto and navigateto in uniapp

什么是响应式对象?响应式对象的创建过程?

DAY FIVE

Article management system based on SSM framework

@TableId can‘t more than one in Class: “com.example.CloseContactSearcher.entity.Activity“.

GPIO简介

The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge

37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme

2022/2/11 summary
随机推荐
如何判断一个数组中的元素包含一个对象的所有属性值
【vulnhub】presidential1
Data analysis course notes (V) common statistical methods, data and spelling, index and composite index
Quaternion attitude calculation of madgwick
37 page overall planning and construction plan for digital Village revitalization of smart agriculture
DAY THREE
uniapp实现从本地上传头像并显示,同时将头像转化为base64格式存储在mysql数据库中
Memory optimization of Amazon memorydb for redis and Amazon elasticache for redis
Interface master v3.9, API low code development tool, build your interface service platform immediately
Article management system based on SSM framework
Personal digestion of DDD
JS import excel & Export Excel
Use mujoco to simulate Cassie robot
Leecode brush questions record sword finger offer 43 The number of occurrences of 1 in integers 1 to n
华为mate8电池价格_华为mate8换电池后充电巨慢
SQL的一种写法,匹配就更新,否则就是插入
Wechat applet UploadFile server, wechat applet wx Uploadfile[easy to understand]
Advanced learning of MySQL -- basics -- multi table query -- inner join
【vulnhub】presidential1
Imeta | Chen Chengjie / Xia Rui of South China Agricultural University released a simple method of constructing Circos map by tbtools