当前位置:网站首页>Methods to improve training speed in deep learning and techniques to reduce video memory (suitable for entry-level trick without too many computing resources)

Methods to improve training speed in deep learning and techniques to reduce video memory (suitable for entry-level trick without too many computing resources)

2022-06-11 07:47:00 Life is a joke

1、 Improve batchsize until GPU RAM Reach full load : Insufficient graphics card utilization
There are often programs that run very slowly , But when you look at the occupancy rate 3%,10%, This is often because CPU and GPU The speed between them doesn't match very well . Because the calculation of the model is basically in GPU Upper , So the general problem is that the speed of loading data is too slow , When it takes a long time to load the data, but the model calculation is solved quickly , be relative to GPU It takes a long time to work , Such occupancy rate is naturally not high . Of course, it may not be a data problem , But the model itself is too simple .

batch_size Bigger . In this way, you can load more data into the video memory at one time , It can increase its occupancy rate , And try to fill it up GPU Of memory .
Dataloader Medium num_workers. This parameter can load data for multiple processes to improve efficiency , Generally, you can choose 4,8,16 wait . however , This quantity is not the more the better , because worker The more , Some allocation and collaboration between processes +I/O Problems can slow things down

2、 Set at the beginning of the program torch.backends.cudnn.benchmark=True. It will take a little extra time for the program to start , Search the most suitable convolution algorithm for each convolution layer of the whole network , And then realize the acceleration of network . The applicable scenario is that the network structure is fixed ( It's not dynamic ), Input shape of network ( Include batch size, Picture size , Input channel ) It is the same. , In fact, it is generally applicable . conversely , If the setting of the convolution layer keeps changing , Will cause the program to optimize continuously , It takes more time .

3、 Enter... Before executing the program torch.cuda.empty_cache.
reason : If in python In the call pytorch It is possible to display memory and GPU Occupancy is not automatically released , At this point, you need to add the following code to delete some unnecessary variables .

4、 Optimize network structure : When the network is too complex , Too many convolution layers , It slows down the training .

5、 By default , The whole network uses 32 Floating point number of bits , If you switch to 16 Floating point number of bits , The amount of video memory occupied will decrease by multiple
6/ Do gradient accumulation , take loss Divided into n, namely loss = loss / n, When performing the n Step and then update the gradient

7、 Choose a smaller data type

By default , The whole network uses 32 Floating point number of bits , If you switch to 16 Floating point number of bits , The amount of video memory occupied will decrease by multiple .

8、 Simplify the model

When designing models , Appropriate reduced models , Like the original two-tier LSTM Turn it into a layer ; The original use LSTM, Now use GRU; Reduce the number of convolutions ; Use... As little as possible Linear etc. .

9、 Data angle

For text data , The number of parameters brought by long sequence increases linearly , Reducing the sequence length properly can greatly reduce the number of parameters .

10、total_loss

in consideration of loss It's a graph with gradient information tensor, therefore , The correct way to find the sum of losses is :

total_loss += loss.item()
Release the tensors and variables you don't need

use del Release tensors and variables that you no longer need , This also requires us to pay attention to the use of variables when writing models , Don't do what you want , Flying all over the sky .

10、Relu Of inplace Parameters

Activation function Relu() There is a default parameter inplace , The default is Flase, When set to True When , We're passing relu() The new value calculated will not occupy new space, but directly cover the original value , This means that it is set to True, It can save some video memory .

11、 Gradient accumulation

First , To learn something about Pytorch Basic knowledge of :

stay Pytorch in , When we execute loss.backward() when , The gradient is calculated for each parameter , And store it in paramter.grad in , be aware , paramter.grad It's a tensor , It accumulates the gradient of each calculation .
stay Pytorch in , Only a call optimizer.step() The network parameters are updated by gradient descent .
We know , batch size It is closely related to the occupation of video memory , But sometimes our batch size It can't be set too small , What about this ?

The answer is gradient accumulation .

Let's take a look at traditional training first :

for i,(feature,target) in enumerate(train_loader):
outputs = model(feature) # Forward propagation
loss = criterion(outputs,target) # Calculate the loss

optimizer.zero_grad()   #  Clear the gradient 
loss.backward()  #  Calculate the gradient 
optimizer.step()  #  Back propagation ,  Update network parameters 

And after adding gradient accumulation , The code looks like this :

for i,(features,target) in enumerate(train_loader):
    outputs = model(images)  #  Forward propagation 
    loss = criterion(outputs,target)  #  Calculate the loss 
    loss = loss/accumulation_steps   #  Optional , If the loss is to be averaged over the training sample 

    loss.backward()  #  Calculate the gradient 
    if((i+1)%accumulation_steps)==0:
        optimizer.step()        #  Back propagation , Update network parameters 
        optimizer.zero_grad()   #  Clear the gradient 

comparison , We found that , Gradient accumulation is essentially accumulation accumulation_steps individual batchsize/accumulationsteps Gradient of , Then the network parameters are updated according to the accumulated gradient , To achieve a real gradient similar to batch_size The effect of . When use , We need to pay attention to the appropriate expansion of the learning rate .

More specifically , We assume that batch size = 4 , accumulation steps = 8 , Gradient accumulation first propagates forward in the form of batch_size=4 To calculate the gradient , But don't update the parameters , Accumulate the gradients , Until we calculated accumulation steps individual batch, Let's update the parameters . In fact, it is essentially equivalent to :

real batch_size = batch_size * accumulation_steps
Gradient accumulation can greatly alleviate GPU The problem of insufficient memory , Recommended .

With Pytorch For example , The training process of a neural network is usually as follows :

for i, (inputs, labels) in enumerate(trainloader):
 optimizer.zero_grad()                   #  Gradient clear 
 outputs = net(inputs)                   #  Positive communication 
 loss = criterion(outputs, labels)       #  Calculate the loss 
 loss.backward()                         #  Back propagation , Calculate the gradient 
 optimizer.step()                        #  Update parameters 
 if (i+1) % evaluation_steps == 0:
  evaluate_model() 

From the code, we can clearly see how the neural network is trained :
1. Put the previous one batch The calculated network gradient is cleared
2. Positive communication , Transfer data to the network , Get the prediction
3. According to the prediction results and label, Calculate the loss value
4. Back propagation using loss , Calculate parameter gradient
5. The calculated parameter gradient is used to update the network parameters

Now let's look at how gradient accumulation is done :

for i, (inputs, labels) in enumerate(trainloader):
 outputs = net(inputs)                   #  Positive communication 
 loss = criterion(outputs, labels)       #  Calculate the loss function 
 loss = loss / accumulation_steps        #  Loss standardization 
 loss.backward()                         #  Back propagation , Calculate the gradient 
 if (i+1) % accumulation_steps == 0:
   optimizer.step()                    #  Update parameters 
   optimizer.zero_grad()               #  Gradient clear 
   if (i+1) % evaluation_steps == 0:
      evaluate_model() 

1. Positive communication , Transfer data to the network , Get the prediction
2. According to the prediction results and label, Calculate the loss value
3. Back propagation using loss , Calculate parameter gradient
4. repeat 1-3, Don't empty the gradient , Instead, add up the gradients
5. After the gradient accumulation reaches a fixed number of times , Update parameters , Then clear the gradient
Summing up , Gradient accumulation is every calculation batch Gradient of , Do not reset , It's the accumulation of gradients , When added to a certain number of times , Then update the network parameters , Then clear the gradient .

By this means of delaying the update of parameters , Can be realized and adopted batch size Similar effect . In the ordinary course of the experiment , I usually use gradient Accumulation Technology , Most of the time , The model effect of gradient accumulation training , Smaller than using batch size The effect of the training model is much better .

stay Bert In the warehouse , That's what I used Trick, Very practical , It's the conscience of a beggar like us Trick.

Gradient checkpoints

This Trick I've never used , After all, the model is not that big

Be careful : increase batch size Make a epoch The number of optimizations can be reduced , Convergence may slow down , So it takes more time to converge ( such as batch_size Become the total number of samples ).

原网站

版权声明
本文为[Life is a joke]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203020517457848.html