当前位置:网站首页>[video memory optimization] deep learning video memory optimization method
[video memory optimization] deep learning video memory optimization method
2022-07-01 15:35:00 【Dudu is too delicious】
Deep learning gpu Video memory of is very important , If the video memory is too small , The model can't run at all , This paper introduces several optimization methods when the video memory is insufficient , It can reduce the video memory requirements of the deep learning model .
Catalog
One 、 Gradient accumulation
Gradient accumulation refers to the process of model training , Train one batch After getting the gradient of the data , Do not immediately update the model parameters with this gradient , Instead, go on to the next batch Data training , Get the gradient and continue the cycle , After many cycles, the gradient keeps accumulating , Until a certain number of times , Update parameters with accumulated gradients , This can play a disguised expansion batch_size The role of .
model = SimpleNet()
mse = MSELoss()
optimizer = SGD(params=model.parameters(), lr=0.1, momentum=0.9)
accumulate_batchs_num = 10 # Add up 10 Sub gradient
for epoch in range(epochs):
for i, (data, label) in enumerate(loader):
output = model(data)
loss = mse(output, label)
scaled.backward()
# When the cumulative batch by accumulate_batchs_num when , Update model parameters
if (i + 1) % accumulate_batchs_num == 0:
# Training models
optimizer.step()
optimizer.clear_grad()
Two 、 Mixing accuracy
Reference resources : Mixed precision training
Floating point data types are mainly divided into double precision (FP64)、 Single precision (FP32)、 Semi precision (FP16), As shown in the figure , Semi precision (FP16) Is a relatively new floating point type , To use in a computer 2 byte (16 position ) Storage . stay IEEE 754-2008 In the standard , It is also called binary16. And the single precision commonly used in calculation (FP32) Double precision (FP64) Type comparison ,FP16 It is more suitable for use in scenes with low accuracy requirements .

With the same super parameters , Mixed precision training uses half precision floating point (FP16) And single precision (FP32) Floating point can achieve the same accuracy as pure single precision training , And it can accelerate the training speed of the model , This is mainly due to NVIDIA from Volta Architecture began to roll out Tensor Core technology . In the use of FP16 The calculation has the following characteristics :
FP16 It can reduce memory bandwidth and storage requirements by half , This allows researchers to use larger and more complex models and larger batch size size .
FP16 You can make full use of NVIDIA Volta、Turing、Ampere framework GPU Provided Tensor Cores technology . In the same GPU On the hardware ,Tensor Cores Of FP16 The calculated throughput is FP32 Of 8 times .
But use FP16 There will also be the following shortcomings :
- Data overflow : Data overflow is easy to understand ,FP16 comparison FP32 The effective range of is much narrower , Use FP16 Replace FP32 There will be an overflow (Overflow) And underflow (Underflow) The situation of . And in deep learning , The gradient of weight in the network model needs to be calculated ( First derivative ), Therefore, the gradient will be smaller than the weight value , Often prone to underflow .
- Rounding error :Rounding Error The indication is when the reverse gradient of the network model is very small , commonly FP32 Be able to express , But switch to FP16 Will be less than the minimum interval in the current interval , Can cause data overflow . Such as 0.00006666666 stay FP32 Can normally express , The switch to FP16 It will be expressed as 0.000067, dissatisfaction FP16 The number of minimum intervals is forcibly rounded .
For deep learning training, you can use FP16 The benefits of , Also avoid precision overflow and rounding error . So you can go through FP16 and FP32 Mixed precision training for (Mixed-Precision), Weight backup can be introduced in the process of mixed accuracy training (Weight Backup)、 Loss amplification (Loss Scaling)、 Precision accumulation (Precision Accumulated) Three related technologies .
1、 Weight backup
Weight backup is mainly used to solve the problem of rounding error . The main idea is to integrate the activation generated in the process of neural network training activations、 gradient gradients、 Intermediate variables and other data , Use... In training FP16 To store , Make a copy at the same time FP32 Weight parameter of weights, For training updates . The details are shown in the following figure , In forward and reverse calculations , Use FP16, But update parameters using FP32.

2、 Loss scaling
In the process of network back propagation , The value of the gradient is usually very small , If you use FP32 Can train normally , But if you use FP16, because FP16 The scope of expression ( The part on the left of the red line in the figure below is FP16 All of them are for 0), It will result in a smaller gradient of 0, The parameters cannot be optimized , The model doesn't converge .

In order to solve the problem of data underflow with too small gradient , Scale the loss function , And then scale the parameters back when optimizing the parameters . The specific operation is :
① Forward propagation :loss = loss * s
② Back propagation :grad = grad / s
Multiply the loss function by a coefficient s, Then the gradient increases in equal proportion , When re optimizing parameters, scale them to 1/s, Then the problem of gradient value underflow can be solved .
torch The sample code is as follows , Reference resources :
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()3、 Precision accumulation
In the process of model training with mixed accuracy , Use FP16 Do matrix multiplication , utilize FP32 To accumulate in the middle of matrix multiplication (accumulated), And then FP32 The value of is converted to FP16 For storage . Briefly , Is the use FP16 Do matrix multiplication , utilize FP32 To make up for the lost accuracy . This can effectively reduce the rounding error in the calculation process , Minimize the loss of accuracy .
3、 ... and 、 Recalculation
Generally speaking, a training of deep learning network consists of three parts :
Forward calculation (forward): At this stage, the operator of the model will be forward calculated , Calculate the input of the operator to get the output , And send it to the next layer as input , Until the result position of the last layer is calculated ( Usually loss ).
Reverse calculation (backward): In this phase , The gradient of parameters of each layer will be calculated by reverse derivation and chain rule .
Gradient update ( Optimize ,optimization): In this phase , The parameters are updated by the gradient obtained by reverse calculation , Also called learning , Parameter optimization .
In the back propagation chain conduction , It needs the output of the middle layer to calculate the gradient of parameters , Therefore, the output of the middle layer will be saved in the training stage . In order to reduce the consumption of video memory , You can not save the calculation results of the m-server , When calculating gradient in back propagation , Then calculate the output of the middle layer by local forward , To calculate the gradient .
Recalculation is the operation of changing space through time .
边栏推荐
- 点云重建方法汇总一(PCL-CGAL)
- 《QT+PCL第六章》点云配准icp系列3
- 6.2 normalization 6.2.6 BC normal form (BCNF) 6.2.9 normalization summary
- Description | Huawei cloud store "commodity recommendation list"
- 《QT+PCL第六章》点云配准icp系列2
- Opencv learning notes 5 -- document scanning +ocr character recognition
- [advanced ROS] lesson 5 TF coordinate transformation in ROS
- Create employee data in SAP s/4hana by importing CSV
- MySQL审计插件介绍
- skywalking 6.4 分布式链路跟踪 使用笔记
猜你喜欢
![[STM32 learning] w25qxx automatic judgment capacity detection based on STM32 USB storage device](/img/41/be7a295d869727e16528041ad08cd4.png)
[STM32 learning] w25qxx automatic judgment capacity detection based on STM32 USB storage device

Wechat applet 02 - Implementation of rotation map and picture click jump

Opencv Learning Notes 6 -- image mosaic

【OpenCV 例程200篇】216. 绘制多段线和多边形

Intelligent operation and maintenance practice: banking business process and single transaction tracking

【目标跟踪】|模板更新 时间上下文信息(UpdateNet)《Learning the Model Update for Siamese Trackers》

Pnas: brain and behavior changes of social anxiety patients with empathic embarrassment

微信小程序02-轮播图实现与图片点击跳转

The solution to turn the newly created XML file into a common file in idea

如何写出好代码 - 防御式编程指南
随机推荐
华为发布HCSP-Solution-5G Security人才认证,助力5G安全人才生态建设
C#/VB.NET 合并PDF文档
Task.Run(), Task.Factory.StartNew() 和 New Task() 的行为不一致分析
【STM32学习】 基于STM32 USB存储设备的w25qxx自动判断容量检测
Flink 系例 之 TableAPI & SQL 与 Kafka 消息获取
[advanced ROS] lesson 5 TF coordinate transformation in ROS
说明 | 华为云云商店「商品推荐榜」
Implementation of wechat web page subscription message
A unifying review of deep and shallow anomaly detection
《QT+PCL第九章》点云重建系列2
Go zero actual combat demo (I)
有些能力,是工作中学不来的,看看这篇超过90%同行
【OpenCV 例程200篇】216. 绘制多段线和多边形
药品溯源夯实安全大堤
【STM32-USB-MSC问题求助】STM32F411CEU6 (WeAct)+w25q64+USB-MSC Flash用SPI2 读出容量只有520KB
【一天学awk】函数与自定义函数
The difference between arrow function and ordinary function in JS
Flink 系例 之 TableAPI & SQL 与 MYSQL 数据查询
TS reports an error don't use 'object' as a type The `object` type is currently hard to use
Can I choose to open an account on Great Wall Securities? Is it safe?