当前位置：网站首页>Dropout: immediate deactivation

Dropout: immediate deactivation

2022-06-30 18:08:00 【*Yuanzai】

List of articles

Dropout brief introduction
- 1.1 Dropout The reason for this
- 1.2 What is? Dropout
2. Dropout Workflow and use
- 2.1 Dropout Specific workflow
- 2.2 Dropout The use of neural networks
3. Why do you say Dropout Can solve over fitting ？
4. Dropout stay Pytorch Source code analysis in
reflection ：
summary ：

Dropout brief introduction

1.1 Dropout The reason for this

In the model of machine learning , If the model has too many parameters , And the training samples are too few , The trained model is easy to produce over fitting phenomenon . In the training of neural network, we often encounter the problem of fitting , Over fitting is embodied in ： The loss function of the model is small in the training data , High prediction accuracy ; But in the test data, the loss function is relatively large , Low prediction accuracy .

Over fitting is a common problem of many machine learning . If the model is over fitted , So the resulting model can hardly be used . In order to solve the problem of over fitting , Generally, the method of model integration is adopted , That is to train multiple models for combination . here , The time-consuming training model becomes a big problem , It's not only time-consuming to train multiple models , Testing multiple models is also time-consuming .

in summary , When training deep neural networks , There are always two big drawbacks ：
Easy to overfit
Time consuming

Dropout Can be more effective to alleviate the occurrence of over fitting , To a certain extent, it can achieve the effect of regularization .

1.2 What is? Dropout

stay 2012 year ,Hinton In his thesis 《Improving neural networks by preventing co-adaptation of feature detectors》 It is proposed that Dropout. When a complex feedforward neural network is trained on a small data set , It is easy to cause over fitting . To prevent over fitting , The performance of neural networks can be improved by preventing the interaction of feature detectors .

stay 2012 year ,Alex、Hinton In his thesis 《ImageNet Classification with Deep Convolutional Neural Networks》 Used in the Dropout Algorithm , Used to prevent over fitting . also , This paper refers to AlexNet The network model has set off an upsurge of neural network applications , And won 2012 Winner of the image recognition competition in , bring CNN Become the core algorithm model of image classification .

And then , There's something about Dropout The article 《Dropout:A Simple Way to Prevent Neural Networks from Overfitting》、《Improving Neural Networks with Dropout》、《Dropout as data augmentation》.

From the paper above , We can feel Dropout The importance of deep learning . that , What exactly is Dropout Well ？

Dropout It can be used as a kind of training depth neural network trick Choose from . In each training batch , By ignoring half of the feature detectors （ Let half of the hidden nodes be 0）, It can obviously reduce the over fitting phenomenon . In this way, the feature detector can be reduced （ Hidden layer nodes ） The interaction between , Detector interaction means that some detectors rely on other detectors to work .

Dropout The simple point is ： When we spread forward , Let the activation value of a neuron have a certain probability p Stop working , This makes the model more generalized , Because it doesn't rely too much on some local features , Pictured 1 Shown .
Insert picture description here

2. Dropout Workflow and use

2.1 Dropout Specific workflow

Suppose we train such a neural network , Pictured 2 Shown .
Insert picture description here
Input is x The output is y, The normal process is ： Let's first put x Spread forward through the Internet , Then the error is propagated back to determine how to update the parameters for the network to learn . Use Dropout after , The process becomes as follows ：

First of all, random （ temporary ） Delete half of the hidden neurons in the network , The input and output neurons remain the same （ chart 3 The dotted line in the middle is part of the temporarily deleted neurons ）

Insert picture description here

And then input x Forward propagation through the modified network , Then the loss results are propagated back through the modified network . After a small number of training samples have completed this process , Update the corresponding parameters according to the random gradient descent method on the undeleted neurons （w,b）.

Then continue to repeat the process ：
Restore the deleted neurons （ At this time, the deleted neurons remain the same , The neurons that have not been deleted have been updated ）
Randomly select a half size subset of hidden layer neurons to temporarily delete （ Back up the parameters of the deleted neuron ）.
For a small sample of training , Previous propagation then back propagation loss and update parameters according to random gradient descent method （w,b）（ The part of parameters not deleted are updated , The deleted neuron parameter keeps the result before being deleted ）.

Repeat the process over and over again .

2.2 Dropout The use of neural networks

Dropout The specific workflow has been described in detail above , But how to make some neurons stop working with a certain probability （ Is deleted ）？ How to implement the code level ？

below , Let's talk about Dropout Some formula derivation and code implementation ideas at the code level .

（1） In the training model phase

Inevitable , Add a probability flow to each unit of the training network .
Insert picture description here
The corresponding formula changes as follows ：

No, Dropout The network calculation formula of ：
use Dropout The network calculation formula of ：

In the formula above Bernoulli Function is to generate probability r vector , That is to say, randomly generate a 0、1 Vector .

Code level implementation makes a neuron with probability p Stop working , In fact, it is to let its activation function value with probability p Turn into 0. For example, the number of neurons in one layer of our network is 1000 individual , The output value of its activation function is y1、y2、y3、…、y1000, We dropout Ratio selection 0.4, So this layer of neurons goes through dropout after ,1000 There will be about 400 The value of is set to 0.

Be careful ： We've screened out some neurons through it , Make the activation value of 0 in the future , We also need to do the vector y1……y1000 Zoom , That is to multiply by 1/(1-p). If you're training , After setting 0 after , No, right y1……y1000 Zoom （rescale）, So when it comes to testing , You need to scale the weights , The operation is as follows .
（2） In the test model phase

When forecasting models , The weight parameter of each neural unit is multiplied by the probability p.
Insert picture description here
Testing phase Dropout The formula ：
$W^{l} _{test}$ = $pW^l$

3. Why do you say Dropout Can solve over fitting ？

Take the average effect ：
Go back to the standard model first dropout, We use the same training data to train 5 A different neural network , In general, you will get 5 Different results , At this point, we can use “5 Take the average of the results ” perhaps “ The majority winning vote strategy ” To decide the final result . for example 3 The result of network judgment is a number 9, So it's very likely that the real result will be numbers 9, The other two networks give false results . such “ Put it all together and take the average ” This strategy can effectively prevent over fitting problems . Because different networks may produce different over fitting , Taking the average is likely to make some “ Contrary ” The fit counteracts each other .dropout Dropping different hidden neurons is like training different networks , Randomly deleting half of hidden neurons leads to different network structures , Whole dropout The process is equivalent to averaging many different neural networks . Different networks produce different over fitting , Some are for each other “ reverse ” By canceling each other, we can reduce the over fitting .

Reduce the complex co adaptation between neurons ：
because dropout The program causes two neurons not to be in one at a time dropout On the Internet . In this way, the updating of the weight no longer depends on the joint action of the hidden nodes with fixed relationship , It prevents some features from working only under other specific features . Force the network to learn more robust features , These characteristics also exist in the random subsets of other neurons . In other words, if our neural network is making some prediction , It shouldn't be too sensitive to certain clues , Even if you lose specific clues , It should also be able to learn some common features from many other clues . Look at it this way dropout It's kind of like L1,L2 Regular , Reducing the weight makes the network more robust to the loss of specific neuron connections .

Dropout Similar to the role of gender in biological evolution ：
Species tend to adapt to this environment in order to survive , Environmental mutations can make it difficult for species to respond in time , The emergence of gender can breed varieties adapted to the new environment , Effectively prevent over fitting , That is to avoid the extinction of species when the environment changes .

4. Dropout stay Pytorch Source code analysis in

Insert picture description here

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import sys, os
hello_pytorch_DIR = os.path.abspath(os.path.dirname(__file__)+os.path.sep+".."+os.path.sep+"..")
sys.path.append(hello_pytorch_DIR)

from PYTORCH.Deep_eye.Pytorch_Camp_master.hello_pytorch.tools.common_tools import set_seed
from torch.utils.tensorboard import SummaryWriter

set_seed(1)  #  Set random seeds 
n_hidden = 200
max_iter = 2000
disp_interval = 400
lr_init = 0.01


# ============================ step 1/5  data  ============================
def gen_data(num_data=10, x_range=(-1, 1)):

    w = 1.5
    train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
    test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())

    return train_x, train_y, test_x, test_y


train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))


# ============================ step 2/5  Model  ============================
#  The output layer is usually not added before Dropout
class MLP(nn.Module):
    def __init__(self, neural_num, d_prob=0.5):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(

            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)


net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)  # d_prob=0., Don't make Dropout
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5) # Dropout Probability value 0.5

# ============================ step 3/5  Optimizer  ============================
optim_normal = torch.optim.SGD(net_prob_0.parameters(), lr=lr_init, momentum=0.9)
optim_reglar = torch.optim.SGD(net_prob_05.parameters(), lr=lr_init, momentum=0.9)

# ============================ step 4/5  Loss function  ============================
loss_func = torch.nn.MSELoss()

# ============================ step 5/5  Iterative training  ============================

writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):

    pred_normal, pred_wdecay = net_prob_0(train_x), net_prob_05(train_x)
    loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

    optim_normal.zero_grad()
    optim_reglar.zero_grad()

    loss_normal.backward()
    loss_wdecay.backward()

    optim_normal.step()
    optim_reglar.step()

    if (epoch+1) % disp_interval == 0:

        net_prob_0.eval() # net Adopt the test status 
        net_prob_05.eval()

        #  visualization 
        for name, layer in net_prob_0.named_parameters():
            writer.add_histogram(name + '_grad_normal', layer.grad, epoch)
            writer.add_histogram(name + '_data_normal', layer, epoch)

        for name, layer in net_prob_05.named_parameters():
            writer.add_histogram(name + '_grad_regularization', layer.grad, epoch)
            writer.add_histogram(name + '_data_regularization', layer, epoch)

        test_pred_prob_0, test_pred_prob_05 = net_prob_0(test_x), net_prob_05(test_x)

        #  mapping 
        plt.clf()
        plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue', s=50, alpha=0.3, label='train')
        plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red', s=50, alpha=0.3, label='test')
        plt.plot(test_x.data.numpy(), test_pred_prob_0.data.numpy(), 'r-', lw=3, label='d_prob_0')
        plt.plot(test_x.data.numpy(), test_pred_prob_05.data.numpy(), 'b--', lw=3, label='d_prob_05')
        plt.text(-0.25, -1.5, 'd_prob_0 loss={:.8f}'.format(loss_normal.item()), fontdict={
    'size': 15, 'color': 'red'})
        plt.text(-0.25, -2, 'd_prob_05 loss={:.6f}'.format(loss_wdecay.item()), fontdict={
    'size': 15, 'color': 'red'})

        plt.ylim((-2.5, 2.5))
        plt.legend(loc='upper left')
        plt.title("Epoch: {}".format(epoch+1))
        plt.show()
        plt.close()

        net_prob_0.train()
        net_prob_05.train()

Insert picture description here
Red line Loss Very low , It has been completely fitted .
The blue curve is added Dropout, It obviously reduces over fitting , Smoother .

Dropout Control the scale of the weight ：（ Function of shrinking weight ）
dropout Will produce the effect of shrinking the square norm of the weight , and L2 Regularization is similar to , The implementation of dropout The result is that it compresses the weight , And some outer layer regularization to prevent over fitting .
Insert picture description here

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import sys, os
hello_pytorch_DIR = os.path.abspath(os.path.dirname(__file__)+os.path.sep+".."+os.path.sep+"..")
sys.path.append(hello_pytorch_DIR)

from PYTORCH.Deep_eye.Pytorch_Camp_master.hello_pytorch.tools.common_tools import set_seed
from torch.utils.tensorboard import SummaryWriter

# set_seed(1) #  Set random seeds 


class Net(nn.Module):
    def __init__(self, neural_num, d_prob=0.5):
        super(Net, self).__init__()

        self.linears = nn.Sequential(

            nn.Dropout(d_prob),
            nn.Linear(neural_num, 1, bias=False),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.linears(x)

input_num = 10000
x = torch.ones((input_num, ), dtype=torch.float32)

net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)

net.train()
y = net(x)
print("output in training mode", y)

net.eval()
y = net(x)
print("output in eval mode", y)

OUT:

output in training mode tensor([9982.], grad_fn=<ReluBackward0>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward0>)

Be careful ： Keras in Dropout The implementation of the , It's blocking out some neurons , Make the activation value of 0 in the future , For the activation value vector x1……x1000 Zoom in , That is to multiply by 1/(1-p).

reflection ：

Above we introduce two methods to carry out Dropout Zoom in , that Dropout Why do you need to zoom ？

Because when we train, we randomly discard some neurons , But you can't throw it away at random . If you discard some neurons , This leads to the problem of unstable results , That is, given a test data , Sometimes output a Sometimes output b, The result is not stable , This is not acceptable to the actual system , Users may think that the model prediction is inaccurate . So a kind of ” compensate “ The solution is to multiply each neuron's weight by one p, In this way “ Overall ” Make the test data and training data are roughly the same . For example, the output of a neuron is x, So in training, it has p The probability of participating in training ,(1-p) The probability of discarding , So the expectation of its output is px+(1-p)0=px. So in the test, multiply the weight of this neuron by p Can get the same expectations .

summary ：

At present Dropout It's heavily used in full connection networks , And it is generally believed that it is set to 0.5 perhaps 0.3, In the hidden layer of convolution network, due to the sparseness of convolution itself and the sparsity of convolution itself ReLu A lot of functions are used ,Dropout The strategy is less used in the hidden layer of convolution network . Overall speaking ,Dropout It's a hyperparameter , According to the specific network 、 Specific application areas to try .

原网站

版权声明
本文为[*Yuanzai]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202160456509005.html