当前位置：网站首页>[video memory optimization] deep learning video memory optimization method

[video memory optimization] deep learning video memory optimization method

2022-07-01 15:35:00 【Dudu is too delicious】

Deep learning gpu Video memory of is very important , If the video memory is too small , The model can't run at all , This paper introduces several optimization methods when the video memory is insufficient , It can reduce the video memory requirements of the deep learning model .

Catalog

One 、 Gradient accumulation

Two 、 Mixing accuracy

1、 Weight backup

2、 Loss scaling

3、 Precision accumulation

3、 ... and 、 Recalculation

One 、 Gradient accumulation

Gradient accumulation refers to the process of model training , Train one batch After getting the gradient of the data , Do not immediately update the model parameters with this gradient , Instead, go on to the next batch Data training , Get the gradient and continue the cycle , After many cycles, the gradient keeps accumulating , Until a certain number of times , Update parameters with accumulated gradients , This can play a disguised expansion batch_size The role of .

model = SimpleNet()
mse = MSELoss()
optimizer = SGD(params=model.parameters(), lr=0.1, momentum=0.9)

accumulate_batchs_num = 10 #  Add up 10 Sub gradient 

for epoch in range(epochs):
    for i, (data, label) in enumerate(loader):

        output = model(data)
        loss = mse(output, label)

        scaled.backward()
        #  When the cumulative  batch  by  accumulate_batchs_num  when , Update model parameters 
        if (i + 1) % accumulate_batchs_num == 0:
            #  Training models 
            optimizer.step()
            optimizer.clear_grad()

Two 、 Mixing accuracy

Reference resources ： Mixed precision training

Floating point data types are mainly divided into double precision （FP64）、 Single precision （FP32）、 Semi precision （FP16）, As shown in the figure , Semi precision （FP16） Is a relatively new floating point type , To use in a computer 2 byte （16 position ） Storage . stay IEEE 754-2008 In the standard , It is also called binary16. And the single precision commonly used in calculation （FP32） Double precision （FP64） Type comparison ,FP16 It is more suitable for use in scenes with low accuracy requirements .

With the same super parameters , Mixed precision training uses half precision floating point （FP16） And single precision （FP32） Floating point can achieve the same accuracy as pure single precision training , And it can accelerate the training speed of the model , This is mainly due to NVIDIA from Volta Architecture began to roll out Tensor Core technology . In the use of FP16 The calculation has the following characteristics ：

FP16 It can reduce memory bandwidth and storage requirements by half , This allows researchers to use larger and more complex models and larger batch size size .
FP16 You can make full use of NVIDIA Volta、Turing、Ampere framework GPU Provided Tensor Cores technology . In the same GPU On the hardware ,Tensor Cores Of FP16 The calculated throughput is FP32 Of 8 times .

But use FP16 There will also be the following shortcomings ：

Data overflow ： Data overflow is easy to understand ,FP16 comparison FP32 The effective range of is much narrower , Use FP16 Replace FP32 There will be an overflow （Overflow） And underflow （Underflow） The situation of . And in deep learning , The gradient of weight in the network model needs to be calculated （ First derivative ）, Therefore, the gradient will be smaller than the weight value , Often prone to underflow .
Rounding error ：Rounding Error The indication is when the reverse gradient of the network model is very small , commonly FP32 Be able to express , But switch to FP16 Will be less than the minimum interval in the current interval , Can cause data overflow . Such as 0.00006666666 stay FP32 Can normally express , The switch to FP16 It will be expressed as 0.000067, dissatisfaction FP16 The number of minimum intervals is forcibly rounded .

For deep learning training, you can use FP16 The benefits of , Also avoid precision overflow and rounding error . So you can go through FP16 and FP32 Mixed precision training for （Mixed-Precision）, Weight backup can be introduced in the process of mixed accuracy training （Weight Backup）、 Loss amplification （Loss Scaling）、 Precision accumulation （Precision Accumulated） Three related technologies .

1、 Weight backup

Weight backup is mainly used to solve the problem of rounding error . The main idea is to integrate the activation generated in the process of neural network training activations、 gradient gradients、 Intermediate variables and other data , Use... In training FP16 To store , Make a copy at the same time FP32 Weight parameter of weights, For training updates . The details are shown in the following figure , In forward and reverse calculations , Use FP16, But update parameters using FP32.

2、 Loss scaling

In the process of network back propagation , The value of the gradient is usually very small , If you use FP32 Can train normally , But if you use FP16, because FP16 The scope of expression （ The part on the left of the red line in the figure below is FP16 All of them are for 0）, It will result in a smaller gradient of 0, The parameters cannot be optimized , The model doesn't converge .

In order to solve the problem of data underflow with too small gradient , Scale the loss function , And then scale the parameters back when optimizing the parameters . The specific operation is ：

① Forward propagation ：loss = loss * s

② Back propagation ：grad = grad / s

Multiply the loss function by a coefficient s, Then the gradient increases in equal proportion , When re optimizing parameters, scale them to 1/s, Then the problem of gradient value underflow can be solved .

torch The sample code is as follows , Reference resources ：

# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)

# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)

        # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
        # Backward passes under autocast are not recommended.
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        scaler.scale(loss).backward()

        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(optimizer)

        # Updates the scale for next iteration.
        scaler.update()

3、 Precision accumulation

In the process of model training with mixed accuracy , Use FP16 Do matrix multiplication , utilize FP32 To accumulate in the middle of matrix multiplication （accumulated）, And then FP32 The value of is converted to FP16 For storage . Briefly , Is the use FP16 Do matrix multiplication , utilize FP32 To make up for the lost accuracy . This can effectively reduce the rounding error in the calculation process , Minimize the loss of accuracy .

3、 ... and 、 Recalculation

Generally speaking, a training of deep learning network consists of three parts ：

Forward calculation （forward）： At this stage, the operator of the model will be forward calculated , Calculate the input of the operator to get the output , And send it to the next layer as input , Until the result position of the last layer is calculated （ Usually loss ）.
Reverse calculation （backward）： In this phase , The gradient of parameters of each layer will be calculated by reverse derivation and chain rule .
Gradient update （ Optimize ,optimization）： In this phase , The parameters are updated by the gradient obtained by reverse calculation , Also called learning , Parameter optimization .
In the back propagation chain conduction , It needs the output of the middle layer to calculate the gradient of parameters , Therefore, the output of the middle layer will be saved in the training stage . In order to reduce the consumption of video memory , You can not save the calculation results of the m-server , When calculating gradient in back propagation , Then calculate the output of the middle layer by local forward , To calculate the gradient .

Recalculation is the operation of changing space through time .

原网站

版权声明
本文为[Dudu is too delicious]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207011524054013.html