当前位置:网站首页>Dropout: immediate deactivation
Dropout: immediate deactivation
2022-06-30 18:08:00 【*Yuanzai】
List of articles
Dropout brief introduction
1.1 Dropout The reason for this
In the model of machine learning , If the model has too many parameters , And the training samples are too few , The trained model is easy to produce over fitting phenomenon . In the training of neural network, we often encounter the problem of fitting , Over fitting is embodied in : The loss function of the model is small in the training data , High prediction accuracy ; But in the test data, the loss function is relatively large , Low prediction accuracy .
Over fitting is a common problem of many machine learning . If the model is over fitted , So the resulting model can hardly be used . In order to solve the problem of over fitting , Generally, the method of model integration is adopted , That is to train multiple models for combination . here , The time-consuming training model becomes a big problem , It's not only time-consuming to train multiple models , Testing multiple models is also time-consuming .
in summary , When training deep neural networks , There are always two big drawbacks :
Easy to overfit
Time consuming
Dropout Can be more effective to alleviate the occurrence of over fitting , To a certain extent, it can achieve the effect of regularization .
1.2 What is? Dropout
stay 2012 year ,Hinton In his thesis 《Improving neural networks by preventing co-adaptation of feature detectors》 It is proposed that Dropout. When a complex feedforward neural network is trained on a small data set , It is easy to cause over fitting . To prevent over fitting , The performance of neural networks can be improved by preventing the interaction of feature detectors .
stay 2012 year ,Alex、Hinton In his thesis 《ImageNet Classification with Deep Convolutional Neural Networks》 Used in the Dropout Algorithm , Used to prevent over fitting . also , This paper refers to AlexNet The network model has set off an upsurge of neural network applications , And won 2012 Winner of the image recognition competition in , bring CNN Become the core algorithm model of image classification .
And then , There's something about Dropout The article 《Dropout:A Simple Way to Prevent Neural Networks from Overfitting》、《Improving Neural Networks with Dropout》、《Dropout as data augmentation》.
From the paper above , We can feel Dropout The importance of deep learning . that , What exactly is Dropout Well ?
Dropout It can be used as a kind of training depth neural network trick Choose from . In each training batch , By ignoring half of the feature detectors ( Let half of the hidden nodes be 0), It can obviously reduce the over fitting phenomenon . In this way, the feature detector can be reduced ( Hidden layer nodes ) The interaction between , Detector interaction means that some detectors rely on other detectors to work .
Dropout The simple point is : When we spread forward , Let the activation value of a neuron have a certain probability p Stop working , This makes the model more generalized , Because it doesn't rely too much on some local features , Pictured 1 Shown .
2. Dropout Workflow and use
2.1 Dropout Specific workflow
Suppose we train such a neural network , Pictured 2 Shown .
Input is x The output is y, The normal process is : Let's first put x Spread forward through the Internet , Then the error is propagated back to determine how to update the parameters for the network to learn . Use Dropout after , The process becomes as follows :
First of all, random ( temporary ) Delete half of the hidden neurons in the network , The input and output neurons remain the same ( chart 3 The dotted line in the middle is part of the temporarily deleted neurons )
And then input x Forward propagation through the modified network , Then the loss results are propagated back through the modified network . After a small number of training samples have completed this process , Update the corresponding parameters according to the random gradient descent method on the undeleted neurons (w,b).
Then continue to repeat the process :
- Restore the deleted neurons ( At this time, the deleted neurons remain the same , The neurons that have not been deleted have been updated )
- Randomly select a half size subset of hidden layer neurons to temporarily delete ( Back up the parameters of the deleted neuron ).
- For a small sample of training , Previous propagation then back propagation loss and update parameters according to random gradient descent method (w,b) ( The part of parameters not deleted are updated , The deleted neuron parameter keeps the result before being deleted ).
Repeat the process over and over again .
2.2 Dropout The use of neural networks
Dropout The specific workflow has been described in detail above , But how to make some neurons stop working with a certain probability ( Is deleted )? How to implement the code level ?
below , Let's talk about Dropout Some formula derivation and code implementation ideas at the code level .
(1) In the training model phase
Inevitable , Add a probability flow to each unit of the training network .
The corresponding formula changes as follows :
- No, Dropout The network calculation formula of :
- use Dropout The network calculation formula of :
In the formula above Bernoulli Function is to generate probability r vector , That is to say, randomly generate a 0、1 Vector .
Code level implementation makes a neuron with probability p Stop working , In fact, it is to let its activation function value with probability p Turn into 0. For example, the number of neurons in one layer of our network is 1000 individual , The output value of its activation function is y1、y2、y3、…、y1000, We dropout Ratio selection 0.4, So this layer of neurons goes through dropout after ,1000 There will be about 400 The value of is set to 0.
Be careful : We've screened out some neurons through it , Make the activation value of 0 in the future , We also need to do the vector y1……y1000 Zoom , That is to multiply by 1/(1-p). If you're training , After setting 0 after , No, right y1……y1000 Zoom (rescale), So when it comes to testing , You need to scale the weights , The operation is as follows .
(2) In the test model phase
When forecasting models , The weight parameter of each neural unit is multiplied by the probability p.
Testing phase Dropout The formula :
W t e s t l W^{l} _{test} Wtestl = p W l pW^l pWl
3. Why do you say Dropout Can solve over fitting ?
Take the average effect :
Go back to the standard model first dropout, We use the same training data to train 5 A different neural network , In general, you will get 5 Different results , At this point, we can use “5 Take the average of the results ” perhaps “ The majority winning vote strategy ” To decide the final result . for example 3 The result of network judgment is a number 9, So it's very likely that the real result will be numbers 9, The other two networks give false results . such “ Put it all together and take the average ” This strategy can effectively prevent over fitting problems . Because different networks may produce different over fitting , Taking the average is likely to make some “ Contrary ” The fit counteracts each other .dropout Dropping different hidden neurons is like training different networks , Randomly deleting half of hidden neurons leads to different network structures , Whole dropout The process is equivalent to averaging many different neural networks . Different networks produce different over fitting , Some are for each other “ reverse ” By canceling each other, we can reduce the over fitting .
Reduce the complex co adaptation between neurons :
because dropout The program causes two neurons not to be in one at a time dropout On the Internet . In this way, the updating of the weight no longer depends on the joint action of the hidden nodes with fixed relationship , It prevents some features from working only under other specific features . Force the network to learn more robust features , These characteristics also exist in the random subsets of other neurons . In other words, if our neural network is making some prediction , It shouldn't be too sensitive to certain clues , Even if you lose specific clues , It should also be able to learn some common features from many other clues . Look at it this way dropout It's kind of like L1,L2 Regular , Reducing the weight makes the network more robust to the loss of specific neuron connections .
Dropout Similar to the role of gender in biological evolution :
Species tend to adapt to this environment in order to survive , Environmental mutations can make it difficult for species to respond in time , The emergence of gender can breed varieties adapted to the new environment , Effectively prevent over fitting , That is to avoid the extinction of species when the environment changes .
4. Dropout stay Pytorch Source code analysis in
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import sys, os
hello_pytorch_DIR = os.path.abspath(os.path.dirname(__file__)+os.path.sep+".."+os.path.sep+"..")
sys.path.append(hello_pytorch_DIR)
from PYTORCH.Deep_eye.Pytorch_Camp_master.hello_pytorch.tools.common_tools import set_seed
from torch.utils.tensorboard import SummaryWriter
set_seed(1) # Set random seeds
n_hidden = 200
max_iter = 2000
disp_interval = 400
lr_init = 0.01
# ============================ step 1/5 data ============================
def gen_data(num_data=10, x_range=(-1, 1)):
w = 1.5
train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())
return train_x, train_y, test_x, test_y
train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))
# ============================ step 2/5 Model ============================
# The output layer is usually not added before Dropout
class MLP(nn.Module):
def __init__(self, neural_num, d_prob=0.5):
super(MLP, self).__init__()
self.linears = nn.Sequential(
nn.Linear(1, neural_num),
nn.ReLU(inplace=True),
nn.Dropout(d_prob),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Dropout(d_prob),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Dropout(d_prob),
nn.Linear(neural_num, 1),
)
def forward(self, x):
return self.linears(x)
net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.) # d_prob=0., Don't make Dropout
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5) # Dropout Probability value 0.5
# ============================ step 3/5 Optimizer ============================
optim_normal = torch.optim.SGD(net_prob_0.parameters(), lr=lr_init, momentum=0.9)
optim_reglar = torch.optim.SGD(net_prob_05.parameters(), lr=lr_init, momentum=0.9)
# ============================ step 4/5 Loss function ============================
loss_func = torch.nn.MSELoss()
# ============================ step 5/5 Iterative training ============================
writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):
pred_normal, pred_wdecay = net_prob_0(train_x), net_prob_05(train_x)
loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)
optim_normal.zero_grad()
optim_reglar.zero_grad()
loss_normal.backward()
loss_wdecay.backward()
optim_normal.step()
optim_reglar.step()
if (epoch+1) % disp_interval == 0:
net_prob_0.eval() # net Adopt the test status
net_prob_05.eval()
# visualization
for name, layer in net_prob_0.named_parameters():
writer.add_histogram(name + '_grad_normal', layer.grad, epoch)
writer.add_histogram(name + '_data_normal', layer, epoch)
for name, layer in net_prob_05.named_parameters():
writer.add_histogram(name + '_grad_regularization', layer.grad, epoch)
writer.add_histogram(name + '_data_regularization', layer, epoch)
test_pred_prob_0, test_pred_prob_05 = net_prob_0(test_x), net_prob_05(test_x)
# mapping
plt.clf()
plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue', s=50, alpha=0.3, label='train')
plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red', s=50, alpha=0.3, label='test')
plt.plot(test_x.data.numpy(), test_pred_prob_0.data.numpy(), 'r-', lw=3, label='d_prob_0')
plt.plot(test_x.data.numpy(), test_pred_prob_05.data.numpy(), 'b--', lw=3, label='d_prob_05')
plt.text(-0.25, -1.5, 'd_prob_0 loss={:.8f}'.format(loss_normal.item()), fontdict={
'size': 15, 'color': 'red'})
plt.text(-0.25, -2, 'd_prob_05 loss={:.6f}'.format(loss_wdecay.item()), fontdict={
'size': 15, 'color': 'red'})
plt.ylim((-2.5, 2.5))
plt.legend(loc='upper left')
plt.title("Epoch: {}".format(epoch+1))
plt.show()
plt.close()
net_prob_0.train()
net_prob_05.train()
Red line Loss Very low , It has been completely fitted .
The blue curve is added Dropout, It obviously reduces over fitting , Smoother .
Dropout Control the scale of the weight :( Function of shrinking weight )
dropout Will produce the effect of shrinking the square norm of the weight , and L2 Regularization is similar to , The implementation of dropout The result is that it compresses the weight , And some outer layer regularization to prevent over fitting .
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import sys, os
hello_pytorch_DIR = os.path.abspath(os.path.dirname(__file__)+os.path.sep+".."+os.path.sep+"..")
sys.path.append(hello_pytorch_DIR)
from PYTORCH.Deep_eye.Pytorch_Camp_master.hello_pytorch.tools.common_tools import set_seed
from torch.utils.tensorboard import SummaryWriter
# set_seed(1) # Set random seeds
class Net(nn.Module):
def __init__(self, neural_num, d_prob=0.5):
super(Net, self).__init__()
self.linears = nn.Sequential(
nn.Dropout(d_prob),
nn.Linear(neural_num, 1, bias=False),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.linears(x)
input_num = 10000
x = torch.ones((input_num, ), dtype=torch.float32)
net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)
net.train()
y = net(x)
print("output in training mode", y)
net.eval()
y = net(x)
print("output in eval mode", y)
OUT:
output in training mode tensor([9982.], grad_fn=<ReluBackward0>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward0>)
Be careful : Keras in Dropout The implementation of the , It's blocking out some neurons , Make the activation value of 0 in the future , For the activation value vector x1……x1000 Zoom in , That is to multiply by 1/(1-p).
reflection :
Above we introduce two methods to carry out Dropout Zoom in , that Dropout Why do you need to zoom ?
Because when we train, we randomly discard some neurons , But you can't throw it away at random . If you discard some neurons , This leads to the problem of unstable results , That is, given a test data , Sometimes output a Sometimes output b, The result is not stable , This is not acceptable to the actual system , Users may think that the model prediction is inaccurate . So a kind of ” compensate “ The solution is to multiply each neuron's weight by one p, In this way “ Overall ” Make the test data and training data are roughly the same . For example, the output of a neuron is x, So in training, it has p The probability of participating in training ,(1-p) The probability of discarding , So the expectation of its output is px+(1-p)0=px. So in the test, multiply the weight of this neuron by p Can get the same expectations .
summary :
At present Dropout It's heavily used in full connection networks , And it is generally believed that it is set to 0.5 perhaps 0.3, In the hidden layer of convolution network, due to the sparseness of convolution itself and the sparsity of convolution itself ReLu A lot of functions are used ,Dropout The strategy is less used in the hidden layer of convolution network . Overall speaking ,Dropout It's a hyperparameter , According to the specific network 、 Specific application areas to try .
边栏推荐
- 4年工作经验,多线程间的5种通信方式都说不出来,你敢信?
- MySQL reports that the column timestamp field cannot be null
- MSF后渗透总结
- Rainbow Brackets 插件的快捷键
- News management system based on SSM
- Zero foundation can also be an apple blockbuster! This free tool can help you render, make special effects and show silky slides
- AnimeSR:可学习的降质算子与新的真实世界动漫VSR数据集
- MySQL之零碎知识点
- Simulation of campus network design based on ENSP
- IEEE TBD SCI影响因子提升至4.271,位列Q1区!
猜你喜欢
Redis (IV) - delete policy
[zero basic IOT pwn] environment construction
4 years of working experience, and you can't tell the five communication modes between multithreads. Can you believe it?
巴比特 | 元宇宙每日必读:未成年人打赏后要求退款,虚拟主播称自己是大冤种,怎么看待这个监管漏洞?...
TCP session hijacking based on hunt1.5
Add code block in word (Reprint)
生成对抗网络,从DCGAN到StyleGAN、pixel2pixel,人脸生成和图像翻译。
Radio and television 5g officially set sail, attracting attention on how to apply the golden band
Advanced Mathematics (Seventh Edition) Tongji University General exercises one person solution
Flutter custom component
随机推荐
零基础也能做Apple大片!这款免费工具帮你渲染、做特效、丝滑展示
Ardunio esp32 obtains real-time temperature and humidity in mqtt protocol (DH11)
4年工作经验,多线程间的5种通信方式都说不出来,你敢信?
6 张图带你搞懂 TCP 为什么是三次握手?
5g has been in business for three years. Where will innovation go in the future?
Lenovo's "dual platform" operation and maintenance solution helps to comprehensively improve the intelligent management ability of the intelligent medical industry
Splitting. JS text title slow loading JS effect
MSF后渗透总结
【剑指Offer】52. 两个链表的第一个公共节点
Small Tools(3) 集成Knife4j3.0.3接口文档
Add code block in word (Reprint)
自旋锁探秘
Deep understanding of JVM (III) - memory structure (III)
应届生毕业之后先就业还是先择业?
Do fresh students get a job or choose a job after graduation?
Horizontal visual error effect JS special effect code
墨天轮沙龙 | 清华乔嘉林:Apache IoTDB,源于清华,建设开源生态之路
Deep understanding of JVM (V) - garbage collection (II)
生成对抗网络,从DCGAN到StyleGAN、pixel2pixel,人脸生成和图像翻译。
Redis (IV) - delete policy