当前位置:网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100
2022-07-07 00:36:00 【ddrrnnpp】
scene : Data sets : Official fashionminst
+ The Internet :alexnet
+pytroch
+relu Activation function
Source code :https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
Knowledge point : Gradient explosion , Gradient dispersion
Learning literature ( Keep up with the big guys ) Yes :
https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
Experimental phenomena :
Phenomenon one
:
1、 As soon as the code starts running, the following occurs
Phenomenon two
:
2、 After I try to reduce the learning rate , Appear halfway loss nan
Phenomenon three
:
Friends a: Sometimes running is no problem , The network has not changed much , Sometimes there are loss nan, Sometimes it doesn't appear
Friends b: Possible causes : Influence of random initialization variable value
Friends a: Try a solution : Changed random seeds , The rounds that appear are just changed
My shortest path solution : Join in BN layer
( What eats mice is a good cat ,hhh)
https://zh-v2.d2l.ai/chapter_convolutional-modern/batch-norm.html
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
# adopt is_grad_enabled To determine whether the current mode is training mode or prediction mode
if not torch.is_grad_enabled():
# If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# Use of two-dimensional convolution , Calculate the channel dimension (axis=1) The mean and variance of .
# Here we need to keep X So that the broadcast operation can be done later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode , Standardize with the current mean and variance
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance of the moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Zoom and shift
return Y, moving_mean.data, moving_var.data
class BatchNorm(nn.Module):
# num_features: The number of outputs of the fully connected layer or the number of output channels of the convolution layer .
# num_dims:2 Represents a fully connected layer ,4 Represents a convolution layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# Stretch and offset parameters involved in gradient sum iteration , Initialize into 1 and 0
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# Variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
def forward(self, X):
# If X Not in memory , take moving_mean and moving_var
# Copied to the X On the video memory
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated moving_mean and moving_var
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
csdn Other solutions
:
Principle 1
:
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
1、 Gradient derivation + The chain rule
1.1、relu Derivation property of activation function + Gradient explosion
1、relu The derivative of the activation function of 1 or 0
2、 Gradient explosion
: Because of the chain rule of derivatives , continuity Multiple layers are greater than 1 The gradient of the product
Will make the gradient bigger and bigger , Eventually, the gradient is too large .
3、 Gradient explosion Will make the parameters of a certain layer w Too big
, Cause network instability , In extreme cases , Multiply the data by a big w There is an overflow , obtain NAN value .
1.2、 The problem of gradient explosion :
2.1、sigmoid Derivation property of activation function + The gradient disappears
1、 Because of the chain rule of derivatives , In successive layers , take Less than 1 The gradient of the product
Will make the gradient smaller and smaller , Finally, the gradient in the first layer is 0.
2.2、 The problem of gradient disappearance :
Analysis of experimental phenomena :
1、relu Activation function
2、 Adjusting the learning rate can make the network output halfway nan
------》 Conclusion :
Gradient explosion
Principle 2
:
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
1、alexnet Relatively deep network :
2、 What is extracted from batch normalization is “ Small batch ”
, With certain Randomness
. a certain extent , The small batch here will To the network Bring some The noise
Come on Control model complexity
.
3、 After batch normalization ,lr The learning rate can be set to a large number , have Accelerate convergence
The role of
Thank you very much for Li Mu's explanation video !!!!, This paper starts with a practical problem , Understand the knowledge points explained by the boss . It's unique
, If there is any infringement 、 The same 、 The error of !!, Please give me some advice !!!!!
边栏推荐
- ZYNQ移植uCOSIII
- 华为mate8电池价格_华为mate8换电池后充电巨慢
- Mujoco finite state machine and trajectory tracking
- 专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统
- 48页数字政府智慧政务一网通办解决方案
- Command line kills window process
- Business process testing based on functional testing
- dynamic programming
- Jenkins' user credentials plug-in installation
- Introduction au GPIO
猜你喜欢
How can computers ensure data security in the quantum era? The United States announced four alternative encryption algorithms
GPIO簡介
Article management system based on SSM framework
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
Testers, how to prepare test data
How engineers treat open source -- the heartfelt words of an old engineer
Geo data mining (III) enrichment analysis of go and KEGG using David database
2021 SASE integration strategic roadmap (I)
File and image comparison tool kaleidoscope latest download
uniapp中redirectTo和navigateTo的区别
随机推荐
Introduction to GPIO
Quickly use various versions of PostgreSQL database in docker
What can the interactive slide screen demonstration bring to the enterprise exhibition hall
Are you ready to automate continuous deployment in ci/cd?
How to set encoding in idea
【vulnhub】presidential1
Mujoco second order simple pendulum modeling and control
一图看懂对程序员的误解:西方程序员眼中的中国程序员
Leecode brush questions record interview questions 32 - I. print binary tree from top to bottom
Compilation of kickstart file
Model-Free Control
【CVPR 2022】半监督目标检测:Dense Learning based Semi-Supervised Object Detection
Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)
Typescript incremental compilation
Quaternion attitude calculation of madgwick
DAY ONE
What is web penetration testing_ Infiltration practice
Operation test of function test basis
How can computers ensure data security in the quantum era? The United States announced four alternative encryption algorithms
Use package FY in Oracle_ Recover_ Data. PCK to recover the table of truncate misoperation