当前位置:网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

2022-07-07 00:36:00 ddrrnnpp

scene : Data sets : Official fashionminst + The Internet :alexnet+pytroch+relu Activation function
Source code :https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
Knowledge point : Gradient explosion , Gradient dispersion
Learning literature ( Keep up with the big guys ) Yes :

https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237

Experimental phenomena :

Phenomenon one

1、 As soon as the code starts running, the following occurs
 Insert picture description here

Phenomenon two

2、 After I try to reduce the learning rate , Appear halfway loss nan
 Insert picture description here

Phenomenon three

Friends a: Sometimes running is no problem , The network has not changed much , Sometimes there are loss nan, Sometimes it doesn't appear
Friends b: Possible causes : Influence of random initialization variable value
Friends a: Try a solution : Changed random seeds , The rounds that appear are just changed
 Insert picture description here

My shortest path solution : Join in BN layer ( What eats mice is a good cat ,hhh)

https://zh-v2.d2l.ai/chapter_convolutional-modern/batch-norm.html
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    #  adopt is_grad_enabled To determine whether the current mode is training mode or prediction mode 
    if not torch.is_grad_enabled():
        #  If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average 
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            #  The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension 
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            #  Use of two-dimensional convolution , Calculate the channel dimension (axis=1) The mean and variance of .
            #  Here we need to keep X So that the broadcast operation can be done later 
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        #  In training mode , Standardize with the current mean and variance 
        X_hat = (X - mean) / torch.sqrt(var + eps)
        #  Update the mean and variance of the moving average 
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  #  Zoom and shift 
    return Y, moving_mean.data, moving_var.data

class BatchNorm(nn.Module):
    # num_features: The number of outputs of the fully connected layer or the number of output channels of the convolution layer .
    # num_dims:2 Represents a fully connected layer ,4 Represents a convolution layer 
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        #  Stretch and offset parameters involved in gradient sum iteration , Initialize into 1 and 0
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        #  Variables that are not model parameters are initialized to 0 and 1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        #  If X Not in memory , take moving_mean and moving_var
        #  Copied to the X On the video memory 
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        #  Save the updated moving_mean and moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

 Insert picture description here

csdn Other solutions

 Insert picture description here

Principle 1

https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237
1、 Gradient derivation + The chain rule

 Insert picture description here
 Insert picture description here

1.1、relu Derivation property of activation function + Gradient explosion

 Insert picture description here
1、relu The derivative of the activation function of 1 or 0
2、 Gradient explosion : Because of the chain rule of derivatives , continuity Multiple layers are greater than 1 The gradient of the product Will make the gradient bigger and bigger , Eventually, the gradient is too large .
3、 Gradient explosion Will make the parameters of a certain layer w Too big , Cause network instability , In extreme cases , Multiply the data by a big w There is an overflow , obtain NAN value .

1.2、 The problem of gradient explosion :

 Insert picture description here

2.1、sigmoid Derivation property of activation function + The gradient disappears  Insert picture description here

1、 Because of the chain rule of derivatives , In successive layers , take Less than 1 The gradient of the product Will make the gradient smaller and smaller , Finally, the gradient in the first layer is 0.

2.2、 The problem of gradient disappearance :

 Insert picture description here
Analysis of experimental phenomena :
1、relu Activation function
2、 Adjusting the learning rate can make the network output halfway nan
------》
Conclusion : Gradient explosion

Principle 2

https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237

1、alexnet Relatively deep network :

 Insert picture description here

2、 What is extracted from batch normalization is “ Small batch ”, With certain Randomness . a certain extent , The small batch here will To the network Bring some The noise Come on Control model complexity . Insert picture description here

3、 After batch normalization ,lr The learning rate can be set to a large number , have Accelerate convergence The role of

 Thank you very much for Li Mu's explanation video !!!!, This paper starts with a practical problem , Understand the knowledge points explained by the boss . It's unique 
, If there is any infringement 、 The same 、 The error of !!, Please give me some advice !!!!!

 Insert picture description here

原网站

版权声明
本文为[ddrrnnpp]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207061649544318.html