当前位置:网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
2022-07-07 00:36:00 【ddrrnnpp】
scene : Data sets : Official fashionminst
+ The Internet :alexnet
+pytroch
+relu Activation function
Source code :https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
Knowledge point : Gradient explosion , Gradient dispersion
Learning literature ( Keep up with the big guys ) Yes :
https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
Experimental phenomena :
Phenomenon one
:
1、 As soon as the code starts running, the following occurs
Phenomenon two
:
2、 After I try to reduce the learning rate , Appear halfway loss nan
Phenomenon three
:
Friends a: Sometimes running is no problem , The network has not changed much , Sometimes there are loss nan, Sometimes it doesn't appear
Friends b: Possible causes : Influence of random initialization variable value
Friends a: Try a solution : Changed random seeds , The rounds that appear are just changed
My shortest path solution : Join in BN layer
( What eats mice is a good cat ,hhh)
https://zh-v2.d2l.ai/chapter_convolutional-modern/batch-norm.html
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# adopt is_grad_enabled To determine whether the current mode is training mode or prediction mode
if not torch.is_grad_enabled():
# If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# Use of two-dimensional convolution , Calculate the channel dimension (axis=1) The mean and variance of .
# Here we need to keep X So that the broadcast operation can be done later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode , Standardize with the current mean and variance
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance of the moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Zoom and shift
return Y, moving_mean.data, moving_var.data
class BatchNorm(nn.Module):
# num_features: The number of outputs of the fully connected layer or the number of output channels of the convolution layer .
# num_dims:2 Represents a fully connected layer ,4 Represents a convolution layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# Stretch and offset parameters involved in gradient sum iteration , Initialize into 1 and 0
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# Variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
def forward(self, X):
# If X Not in memory , take moving_mean and moving_var
# Copied to the X On the video memory
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated moving_mean and moving_var
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
csdn Other solutions
:
Principle 1
:
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
1、 Gradient derivation + The chain rule
1.1、relu Derivation property of activation function + Gradient explosion
1、relu The derivative of the activation function of 1 or 0
2、 Gradient explosion
: Because of the chain rule of derivatives , continuity Multiple layers are greater than 1 The gradient of the product
Will make the gradient bigger and bigger , Eventually, the gradient is too large .
3、 Gradient explosion Will make the parameters of a certain layer w Too big
, Cause network instability , In extreme cases , Multiply the data by a big w There is an overflow , obtain NAN value .
1.2、 The problem of gradient explosion :
2.1、sigmoid Derivation property of activation function + The gradient disappears

1、 Because of the chain rule of derivatives , In successive layers , take Less than 1 The gradient of the product
Will make the gradient smaller and smaller , Finally, the gradient in the first layer is 0.
2.2、 The problem of gradient disappearance :
Analysis of experimental phenomena :
1、relu Activation function
2、 Adjusting the learning rate can make the network output halfway nan
------》 Conclusion :
Gradient explosion
Principle 2
:
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
1、alexnet Relatively deep network :
2、 What is extracted from batch normalization is “ Small batch ”
, With certain Randomness
. a certain extent , The small batch here will To the network Bring some The noise
Come on Control model complexity
.
3、 After batch normalization ,lr The learning rate can be set to a large number , have Accelerate convergence
The role of
Thank you very much for Li Mu's explanation video !!!!, This paper starts with a practical problem , Understand the knowledge points explained by the boss . It's unique
, If there is any infringement 、 The same 、 The error of !!, Please give me some advice !!!!!
边栏推荐
- JWT signature does not match locally computed signature. JWT validity cannot be asserted and should
- [vector retrieval research series] product introduction
- Understand the misunderstanding of programmers: Chinese programmers in the eyes of Western programmers
- MIT 6.824 - raft Student Guide
- Encryption algorithm - password security
- PXE server configuration
- How to judge whether an element in an array contains all attribute values of an object
- C语言输入/输出流和文件操作【二】
- Core knowledge of distributed cache
- 2022年PMP项目管理考试敏捷知识点(9)
猜你喜欢
rancher集成ldap,实现统一账号登录
Win10 startup error, press F9 to enter how to repair?
Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
DAY TWO
48 page digital government smart government all in one solution
Mujoco Jacobi - inverse motion - sensor
一图看懂对程序员的误解:西方程序员眼中的中国程序员
陀螺仪的工作原理
随机推荐
stm32F407-------DAC数模转换
C language input / output stream and file operation [II]
PostgreSQL uses pgpool II to realize read-write separation + load balancing
Policy Gradient Methods
Why should a complete knapsack be traversed in sequence? Briefly explain
If the college entrance examination goes well, I'm already graying out at the construction site at the moment
微信小程序uploadfile服务器,微信小程序之wx.uploadFile[通俗易懂]
TypeScript中使用类型别名
基于SSM框架的文章管理系统
rancher集成ldap,实现统一账号登录
VTK volume rendering program design of 3D scanned volume data
2022/2/10 summary
[vector retrieval research series] product introduction
【CVPR 2022】目标检测SOTA:DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
2022 PMP project management examination agile knowledge points (9)
专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
一图看懂对程序员的误解:西方程序员眼中的中国程序员
Leecode brush questions record sword finger offer 11 Rotate the minimum number of the array
Markov decision process
Introduction au GPIO