当前位置:网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
2022-07-07 00:36:00 【ddrrnnpp】
scene : Data sets : Official fashionminst
+ The Internet :alexnet
+pytroch
+relu Activation function
Source code :https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
Knowledge point : Gradient explosion , Gradient dispersion
Learning literature ( Keep up with the big guys ) Yes :
https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
Experimental phenomena :
Phenomenon one
:
1、 As soon as the code starts running, the following occurs
Phenomenon two
:
2、 After I try to reduce the learning rate , Appear halfway loss nan
Phenomenon three
:
Friends a: Sometimes running is no problem , The network has not changed much , Sometimes there are loss nan, Sometimes it doesn't appear
Friends b: Possible causes : Influence of random initialization variable value
Friends a: Try a solution : Changed random seeds , The rounds that appear are just changed
My shortest path solution : Join in BN layer
( What eats mice is a good cat ,hhh)
https://zh-v2.d2l.ai/chapter_convolutional-modern/batch-norm.html
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# adopt is_grad_enabled To determine whether the current mode is training mode or prediction mode
if not torch.is_grad_enabled():
# If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# Use of two-dimensional convolution , Calculate the channel dimension (axis=1) The mean and variance of .
# Here we need to keep X So that the broadcast operation can be done later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode , Standardize with the current mean and variance
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance of the moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Zoom and shift
return Y, moving_mean.data, moving_var.data
class BatchNorm(nn.Module):
# num_features: The number of outputs of the fully connected layer or the number of output channels of the convolution layer .
# num_dims:2 Represents a fully connected layer ,4 Represents a convolution layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# Stretch and offset parameters involved in gradient sum iteration , Initialize into 1 and 0
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# Variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
def forward(self, X):
# If X Not in memory , take moving_mean and moving_var
# Copied to the X On the video memory
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated moving_mean and moving_var
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
csdn Other solutions
:
Principle 1
:
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
1、 Gradient derivation + The chain rule
1.1、relu Derivation property of activation function + Gradient explosion
1、relu The derivative of the activation function of 1 or 0
2、 Gradient explosion
: Because of the chain rule of derivatives , continuity Multiple layers are greater than 1 The gradient of the product
Will make the gradient bigger and bigger , Eventually, the gradient is too large .
3、 Gradient explosion Will make the parameters of a certain layer w Too big
, Cause network instability , In extreme cases , Multiply the data by a big w There is an overflow , obtain NAN value .
1.2、 The problem of gradient explosion :
2.1、sigmoid Derivation property of activation function + The gradient disappears
1、 Because of the chain rule of derivatives , In successive layers , take Less than 1 The gradient of the product
Will make the gradient smaller and smaller , Finally, the gradient in the first layer is 0.
2.2、 The problem of gradient disappearance :
Analysis of experimental phenomena :
1、relu Activation function
2、 Adjusting the learning rate can make the network output halfway nan
------》 Conclusion :
Gradient explosion
Principle 2
:
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
1、alexnet Relatively deep network :
2、 What is extracted from batch normalization is “ Small batch ”
, With certain Randomness
. a certain extent , The small batch here will To the network Bring some The noise
Come on Control model complexity
.
3、 After batch normalization ,lr The learning rate can be set to a large number , have Accelerate convergence
The role of
Thank you very much for Li Mu's explanation video !!!!, This paper starts with a practical problem , Understand the knowledge points explained by the boss . It's unique
, If there is any infringement 、 The same 、 The error of !!, Please give me some advice !!!!!
边栏推荐
- Typescript incremental compilation
- Advanced learning of MySQL -- basics -- multi table query -- external connection
- Understand the misunderstanding of programmers: Chinese programmers in the eyes of Western programmers
- DAY SIX
- TypeScript中使用类型别名
- JS import excel & Export Excel
- How to judge whether an element in an array contains all attribute values of an object
- Advanced learning of MySQL -- basics -- multi table query -- inner join
- 从外企离开,我才知道什么叫尊重跟合规…
- MIT 6.824 - raft Student Guide
猜你喜欢
If the college entrance examination goes well, I'm already graying out at the construction site at the moment
刘永鑫报告|微生物组数据分析与科学传播(晚7点半)
Article management system based on SSM framework
37頁數字鄉村振興智慧農業整體規劃建設方案
2022年PMP项目管理考试敏捷知识点(9)
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
Jenkins' user credentials plug-in installation
48页数字政府智慧政务一网通办解决方案
iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
37页数字乡村振兴智慧农业整体规划建设方案
随机推荐
Leecode brushes questions and records interview questions 01.02 Determine whether it is character rearrangement for each other
Data analysis course notes (III) array shape and calculation, numpy storage / reading data, indexing, slicing and splicing
Mujoco Jacobi - inverse motion - sensor
What is web penetration testing_ Infiltration practice
[vector retrieval research series] product introduction
TypeScript增量编译
【软件逆向-求解flag】内存获取、逆变换操作、线性变换、约束求解
Common shortcuts to idea
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
【vulnhub】presidential1
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
File and image comparison tool kaleidoscope latest download
工程师如何对待开源 --- 一个老工程师的肺腑之言
The difference between redirectto and navigateto in uniapp
stm32F407-------SPI通信
【vulnhub】presidential1
TypeScript中使用类型别名
Introduction au GPIO
Model-Free Prediction
Why should a complete knapsack be traversed in sequence? Briefly explain