当前位置:网站首页>Methods to improve training speed in deep learning and techniques to reduce video memory (suitable for entry-level trick without too many computing resources)
Methods to improve training speed in deep learning and techniques to reduce video memory (suitable for entry-level trick without too many computing resources)
2022-06-11 07:47:00 【Life is a joke】
1、 Improve batchsize until GPU RAM Reach full load : Insufficient graphics card utilization
There are often programs that run very slowly , But when you look at the occupancy rate 3%,10%, This is often because CPU and GPU The speed between them doesn't match very well . Because the calculation of the model is basically in GPU Upper , So the general problem is that the speed of loading data is too slow , When it takes a long time to load the data, but the model calculation is solved quickly , be relative to GPU It takes a long time to work , Such occupancy rate is naturally not high . Of course, it may not be a data problem , But the model itself is too simple .
batch_size Bigger . In this way, you can load more data into the video memory at one time , It can increase its occupancy rate , And try to fill it up GPU Of memory .
Dataloader Medium num_workers. This parameter can load data for multiple processes to improve efficiency , Generally, you can choose 4,8,16 wait . however , This quantity is not the more the better , because worker The more , Some allocation and collaboration between processes +I/O Problems can slow things down
2、 Set at the beginning of the program torch.backends.cudnn.benchmark=True. It will take a little extra time for the program to start , Search the most suitable convolution algorithm for each convolution layer of the whole network , And then realize the acceleration of network . The applicable scenario is that the network structure is fixed ( It's not dynamic ), Input shape of network ( Include batch size, Picture size , Input channel ) It is the same. , In fact, it is generally applicable . conversely , If the setting of the convolution layer keeps changing , Will cause the program to optimize continuously , It takes more time .
3、 Enter... Before executing the program torch.cuda.empty_cache.
reason : If in python In the call pytorch It is possible to display memory and GPU Occupancy is not automatically released , At this point, you need to add the following code to delete some unnecessary variables .
4、 Optimize network structure : When the network is too complex , Too many convolution layers , It slows down the training .
5、 By default , The whole network uses 32 Floating point number of bits , If you switch to 16 Floating point number of bits , The amount of video memory occupied will decrease by multiple
6/ Do gradient accumulation , take loss Divided into n, namely loss = loss / n, When performing the n Step and then update the gradient
7、 Choose a smaller data type
By default , The whole network uses 32 Floating point number of bits , If you switch to 16 Floating point number of bits , The amount of video memory occupied will decrease by multiple .
8、 Simplify the model
When designing models , Appropriate reduced models , Like the original two-tier LSTM Turn it into a layer ; The original use LSTM, Now use GRU; Reduce the number of convolutions ; Use... As little as possible Linear etc. .
9、 Data angle
For text data , The number of parameters brought by long sequence increases linearly , Reducing the sequence length properly can greatly reduce the number of parameters .
10、total_loss
in consideration of loss It's a graph with gradient information tensor, therefore , The correct way to find the sum of losses is :
total_loss += loss.item()
Release the tensors and variables you don't need
use del Release tensors and variables that you no longer need , This also requires us to pay attention to the use of variables when writing models , Don't do what you want , Flying all over the sky .
10、Relu Of inplace Parameters
Activation function Relu() There is a default parameter inplace , The default is Flase, When set to True When , We're passing relu() The new value calculated will not occupy new space, but directly cover the original value , This means that it is set to True, It can save some video memory .
11、 Gradient accumulation
First , To learn something about Pytorch Basic knowledge of :
stay Pytorch in , When we execute loss.backward() when , The gradient is calculated for each parameter , And store it in paramter.grad in , be aware , paramter.grad It's a tensor , It accumulates the gradient of each calculation .
stay Pytorch in , Only a call optimizer.step() The network parameters are updated by gradient descent .
We know , batch size It is closely related to the occupation of video memory , But sometimes our batch size It can't be set too small , What about this ?
The answer is gradient accumulation .
Let's take a look at traditional training first :
for i,(feature,target) in enumerate(train_loader):
outputs = model(feature) # Forward propagation
loss = criterion(outputs,target) # Calculate the loss
optimizer.zero_grad() # Clear the gradient
loss.backward() # Calculate the gradient
optimizer.step() # Back propagation , Update network parameters
And after adding gradient accumulation , The code looks like this :
for i,(features,target) in enumerate(train_loader):
outputs = model(images) # Forward propagation
loss = criterion(outputs,target) # Calculate the loss
loss = loss/accumulation_steps # Optional , If the loss is to be averaged over the training sample
loss.backward() # Calculate the gradient
if((i+1)%accumulation_steps)==0:
optimizer.step() # Back propagation , Update network parameters
optimizer.zero_grad() # Clear the gradient
comparison , We found that , Gradient accumulation is essentially accumulation accumulation_steps individual batchsize/accumulationsteps Gradient of , Then the network parameters are updated according to the accumulated gradient , To achieve a real gradient similar to batch_size The effect of . When use , We need to pay attention to the appropriate expansion of the learning rate .
More specifically , We assume that batch size = 4 , accumulation steps = 8 , Gradient accumulation first propagates forward in the form of batch_size=4 To calculate the gradient , But don't update the parameters , Accumulate the gradients , Until we calculated accumulation steps individual batch, Let's update the parameters . In fact, it is essentially equivalent to :
real batch_size = batch_size * accumulation_steps
Gradient accumulation can greatly alleviate GPU The problem of insufficient memory , Recommended .
With Pytorch For example , The training process of a neural network is usually as follows :
for i, (inputs, labels) in enumerate(trainloader):
optimizer.zero_grad() # Gradient clear
outputs = net(inputs) # Positive communication
loss = criterion(outputs, labels) # Calculate the loss
loss.backward() # Back propagation , Calculate the gradient
optimizer.step() # Update parameters
if (i+1) % evaluation_steps == 0:
evaluate_model()
From the code, we can clearly see how the neural network is trained :
1. Put the previous one batch The calculated network gradient is cleared
2. Positive communication , Transfer data to the network , Get the prediction
3. According to the prediction results and label, Calculate the loss value
4. Back propagation using loss , Calculate parameter gradient
5. The calculated parameter gradient is used to update the network parameters
Now let's look at how gradient accumulation is done :
for i, (inputs, labels) in enumerate(trainloader):
outputs = net(inputs) # Positive communication
loss = criterion(outputs, labels) # Calculate the loss function
loss = loss / accumulation_steps # Loss standardization
loss.backward() # Back propagation , Calculate the gradient
if (i+1) % accumulation_steps == 0:
optimizer.step() # Update parameters
optimizer.zero_grad() # Gradient clear
if (i+1) % evaluation_steps == 0:
evaluate_model()
1. Positive communication , Transfer data to the network , Get the prediction
2. According to the prediction results and label, Calculate the loss value
3. Back propagation using loss , Calculate parameter gradient
4. repeat 1-3, Don't empty the gradient , Instead, add up the gradients
5. After the gradient accumulation reaches a fixed number of times , Update parameters , Then clear the gradient
Summing up , Gradient accumulation is every calculation batch Gradient of , Do not reset , It's the accumulation of gradients , When added to a certain number of times , Then update the network parameters , Then clear the gradient .
By this means of delaying the update of parameters , Can be realized and adopted batch size Similar effect . In the ordinary course of the experiment , I usually use gradient Accumulation Technology , Most of the time , The model effect of gradient accumulation training , Smaller than using batch size The effect of the training model is much better .
stay Bert In the warehouse , That's what I used Trick, Very practical , It's the conscience of a beggar like us Trick.
Gradient checkpoints
This Trick I've never used , After all, the model is not that big
Be careful : increase batch size Make a epoch The number of optimizations can be reduced , Convergence may slow down , So it takes more time to converge ( such as batch_size Become the total number of samples ).
边栏推荐
- 零基础自学SQL课程 | UNION 联合查询
- C language - Growth Diary -02- function
- Database connection pool and bdutils tool
- C. Manipulating History(贪心/哈希/思维/好题)
- Euler's theorem and its extension (with proof)
- 860. 柠檬水找零
- C language - Growth Diary -03- function definition and function prototype declaration
- 黑群晖DSM7.0.1物理机安装教程
- C- print 99 multiplication table
- Switch statement
猜你喜欢

Data visualization and Matplotlib

C- print 99 multiplication table

C wechat upload form data

C language to achieve a simple game - minesweeping

【IoT】项目管理:如何打造更好的跨职能团队?

TiDB Cloud 上线 Google Cloud Marketplace,以全新一栈式实时 HTAP 数据库赋能全球开发者

2. Graduated from this course, and the bank has outsourced testing work for more than 4 months. Talk about some real feelings

C language to achieve three piece chess (not artificial mental retardation ha ha ha)

wordcloud的使用

如何准备PMP新版大纲考试?
随机推荐
MFC auxiliary CString string splicing
Flask页面的分页
134. 加油站
Euler's theorem and its extension (with proof)
.NET C#基础(6):命名空间 - 有名字的作用域
Database connection pool and bdutils tool
Arduino_ STM development record
Sort - merge sort
Simple use of string
【AtCoder1984】Wide Swap (拓扑排序转化)
Switch statement
Use of wordcloud
C language lesson 2
Note: JDBC
[atcoder1981] short diameter (graph theory thinking)
20200810 T2 dispatch money
C language - Growth Diary -03- function definition and function prototype declaration
String Simulation Implementation
C language three chess games
20200727 T2 small w playing game [generating function (binomial inversion technique)]