当前位置：网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

2022-07-07 00:36:00 【ddrrnnpp】

scene ： Data sets ： Official fashionminst + The Internet ：alexnet+pytroch+relu Activation function
Source code ：https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
Knowledge point ： Gradient explosion , Gradient dispersion
Learning literature （ Keep up with the big guys ） Yes ：

https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html
https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237
https://zh-v2.d2l.ai/chapter_convolutional-modern/alexnet.html
https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237

Experimental phenomena ：

`Phenomenon one` ：

1、 As soon as the code starts running, the following occurs
Insert picture description here

`Phenomenon two` ：

2、 After I try to reduce the learning rate , Appear halfway loss nan
Insert picture description here

`Phenomenon three` ：

Friends a： Sometimes running is no problem , The network has not changed much , Sometimes there are loss nan, Sometimes it doesn't appear
Friends b： Possible causes ： Influence of random initialization variable value
Friends a： Try a solution ： Changed random seeds , The rounds that appear are just changed
Insert picture description here

My shortest path solution ： `Join in BN layer` （ What eats mice is a good cat ,hhh）

https://zh-v2.d2l.ai/chapter_convolutional-modern/batch-norm.html

def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    #  adopt is_grad_enabled To determine whether the current mode is training mode or prediction mode 
    if not torch.is_grad_enabled():
        #  If it is in prediction mode , The mean and variance obtained by directly using the incoming moving average 
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            #  The case of using the full connection layer , Calculate the mean and variance on the characteristic dimension 
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            #  Use of two-dimensional convolution , Calculate the channel dimension （axis=1） The mean and variance of .
            #  Here we need to keep X So that the broadcast operation can be done later 
            mean = X.mean(dim=(0, 2, 3), keepdim=True)
            var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
        #  In training mode , Standardize with the current mean and variance 
        X_hat = (X - mean) / torch.sqrt(var + eps)
        #  Update the mean and variance of the moving average 
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  #  Zoom and shift 
    return Y, moving_mean.data, moving_var.data

class BatchNorm(nn.Module):
    # num_features： The number of outputs of the fully connected layer or the number of output channels of the convolution layer .
    # num_dims：2 Represents a fully connected layer ,4 Represents a convolution layer 
    def __init__(self, num_features, num_dims):
        super().__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        #  Stretch and offset parameters involved in gradient sum iteration , Initialize into 1 and 0
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        #  Variables that are not model parameters are initialized to 0 and 1
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)

    def forward(self, X):
        #  If X Not in memory , take moving_mean and moving_var
        #  Copied to the X On the video memory 
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        #  Save the updated moving_mean and moving_var
        Y, self.moving_mean, self.moving_var = batch_norm(
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

Insert picture description here

`csdn Other solutions` ：

Insert picture description here

`Principle 1` ：

https://www.bilibili.com/video/BV1u64y1i75a?p=2&vd_source=d49d528422c02c473340ce042b8c8237

`1、 Gradient derivation + The chain rule`

Insert picture description here

`1.1、relu Derivation property of activation function + Gradient explosion`

Insert picture description here
1、relu The derivative of the activation function of 1 or 0
2、 Gradient explosion ： Because of the chain rule of derivatives , continuity Multiple layers are greater than 1 The gradient of the product Will make the gradient bigger and bigger , Eventually, the gradient is too large .
3、 Gradient explosion Will make the parameters of a certain layer w Too big , Cause network instability , In extreme cases , Multiply the data by a big w There is an overflow , obtain NAN value .

`1.2、 The problem of gradient explosion ：`

Insert picture description here

`2.1、sigmoid Derivation property of activation function + The gradient disappears`

1、 Because of the chain rule of derivatives , In successive layers , take Less than 1 The gradient of the product Will make the gradient smaller and smaller , Finally, the gradient in the first layer is 0.

`2.2、 The problem of gradient disappearance ：`

Insert picture description here
Analysis of experimental phenomena ：
1、relu Activation function
2、 Adjusting the learning rate can make the network output halfway nan
------》
Conclusion ： Gradient explosion

`Principle 2` ：

https://www.bilibili.com/video/BV1X44y1r77r?spm_id_from=333.999.0.0&vd_source=d49d528422c02c473340ce042b8c8237

`1、alexnet Relatively deep network ：`

Insert picture description here

2、 What is extracted from batch normalization is `“ Small batch ”`, With certain `Randomness` . a certain extent , The small batch here will To the network Bring some `The noise` Come on `Control model complexity` .

3、 After batch normalization ,lr The learning rate can be set to a large number , have `Accelerate convergence` The role of

 Thank you very much for Li Mu's explanation video ！！！！, This paper starts with a practical problem , Understand the knowledge points explained by the boss . It's unique 
, If there is any infringement 、 The same 、 The error of !!, Please give me some advice ！！！！！

Insert picture description here

原网站

版权声明
本文为[ddrrnnpp]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207061649544318.html

当前位置：网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

`Phenomenon one` ：

`Phenomenon two` ：

`Phenomenon three` ：

My shortest path solution ： `Join in BN layer` （ What eats mice is a good cat ,hhh）

`csdn Other solutions` ：

`Principle 1` ：

`1、 Gradient derivation + The chain rule`

`1.1、relu Derivation property of activation function + Gradient explosion`

`1.2、 The problem of gradient explosion ：`

`2.1、sigmoid Derivation property of activation function + The gradient disappears`

`2.2、 The problem of gradient disappearance ：`

`Principle 2` ：

`1、alexnet Relatively deep network ：`

2、 What is extracted from batch normalization is `“ Small batch ”`, With certain `Randomness` . a certain extent , The small batch here will To the network Bring some `The noise` Come on `Control model complexity` .

3、 After batch normalization ,lr The learning rate can be set to a large number , have `Accelerate convergence` The role of

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

Alexnet experiment encounters: loss Nan, train ACC 0.100, test ACC 0.100

Phenomenon one ：

Phenomenon two ：

Phenomenon three ：

My shortest path solution ： Join in BN layer （ What eats mice is a good cat ,hhh）

csdn Other solutions ：

Principle 1 ：

1、 Gradient derivation + The chain rule

1.1、relu Derivation property of activation function + Gradient explosion

1.2、 The problem of gradient explosion ：

2.1、sigmoid Derivation property of activation function + The gradient disappears

2.2、 The problem of gradient disappearance ：

Principle 2 ：

1、alexnet Relatively deep network ：

2、 What is extracted from batch normalization is “ Small batch ”, With certain Randomness . a certain extent , The small batch here will To the network Bring some The noise Come on Control model complexity .

3、 After batch normalization ,lr The learning rate can be set to a large number , have Accelerate convergence The role of

边栏推荐

猜你喜欢

随机推荐

`Phenomenon one` ：

`Phenomenon two` ：

`Phenomenon three` ：

My shortest path solution ： `Join in BN layer` （ What eats mice is a good cat ,hhh）

`csdn Other solutions` ：

`Principle 1` ：

`1、 Gradient derivation + The chain rule`

`1.1、relu Derivation property of activation function + Gradient explosion`

`1.2、 The problem of gradient explosion ：`

`2.1、sigmoid Derivation property of activation function + The gradient disappears`

`2.2、 The problem of gradient disappearance ：`

`Principle 2` ：

`1、alexnet Relatively deep network ：`

2、 What is extracted from batch normalization is `“ Small batch ”`, With certain `Randomness` . a certain extent , The small batch here will To the network Bring some `The noise` Come on `Control model complexity` .

3、 After batch normalization ,lr The learning rate can be set to a large number , have `Accelerate convergence` The role of