当前位置：网站首页>[pytorch record] automatic hybrid accuracy training torch cuda. amp

[pytorch record] automatic hybrid accuracy training torch cuda. amp

2022-07-01 19:19:00 【magic_ ll】

Nvidia stay Volta Introduce... Into the architecture Tensor Core unit , To support FP32 and FP16 Calculation of mixing accuracy .tensor core It is a calculation unit of matrix multiplication and accumulation , Every tensor core When the 64 Floating point mixed precision operations （FP16 Matrix multiplication and FP32 Add up ）.
In the same year, a pytorch Expand apex, To support the automatic mixing accuracy training of model parameters .
Pytorch1.6 After the version , Start native support amp, namely torch.cuda.amp, yes nvidia Developers contribute to pytorch Inside , Only support tensor core Of CUDA Hardware can enjoy amp Advantages .

1 FP16 Semi precision
FP16 and FP32, Is the binary floating-point data type used by the computer .
FP16 Half precision , Use 2 Bytes .FP32 namely Float.

among ,sign Is positive or negative ,exponent Bitwise index $2^{(n-15+1)}$ , The specific details are not explained here . Baidu when you need to see .
float The representation of a type in memory

Use alone FP16：
advantage ：
Reduce the occupation of video memory , So as to support more batchsize、 Larger model and larger input size Training , Sometimes, it will improve the accuracy
Speed up the calculation of training and reasoning , It can double the speed
shortcoming ：
Spillover problem ：
because FP16 The dynamic range ratio of FP32 The numerical range of is much smaller , Therefore, it is easy to have up overflow and down overflow in the calculation process , And then there was "NAN" The problem of . In deep learning , Because the gradient of the activation function is often smaller than the weight gradient , It is more prone to overflow .
When the first L When the gradient of the layer overflows , The first L-1 The weights of all previous layers cannot be updated
Rounding error
finger When the ladder goes through hours , Less than the minimum interval in the current interval , This gradient update may fail . such as
FP16 Of $2^{-3}+2^{-14}=2^{-3}$ , At this point, rounding error occurs ： stay $2^{-3}, 2^{-2}]$ between , Than $2^{-3}$ The next big number is ( $2^{-3}+2^{-13}$ )
import numpy as np

a = np.array(2**(-3),dtype=np.float16)
b = np.array(2**(-14),dtype=np.float16)
c = a+b
print(a)                # 0.125
print('%f'%b)   # 0.000061
print(c)                # 0.125
pytorch Data types in ：
stay pytorch in , Altogether 10 Type of tensor：
torch.FloatTensor – 32bit floating point （pytorch Created by default tensor The type of ）
torch.DoubleTensor – 64bit floating point
torch.HalfTensor – 16bit floating piont1
torch.BFloat16Tensor – 16bit floating piont2
torch.ByteTensor – 8bit integer(unsigned)
torch.CharTensor – 8bit integer(signed)
torch.ShortTensor – 16bit integer(signed)
torch.IntTensor – 32bit integer(signed)
torch.LongTensor – 64bit integer(signed)
torch.BoolTensor – Boolean
import torch 
 tensor = torch.zeros(20,20)
 print(tensor.type()) 

2 Hybrid accuracy training mechanism
Automatic mixing accuracy （Automatic Mixed Precision, AMP） Training , Is training a numerical accuracy of 32 The model of , Operation of some operators The numerical accuracy is FP16, The operation accuracy of other operators is FP32. Which specific operators use precision ,amp It is automatically set , No additional user settings are required .
This does not change the model 、 Without reducing the accuracy of model training , It can shorten the training time , Reduce storage requirements , So as to support more batchsize、 Larger model and larger input size Training .

torch.cuda.amp It provides users with a very convenient hybrid accuracy training mechanism , By using amp.autocast and amp.GradScaler To achieve ：
Users do not need to manually adjust the model parameters dtype,amp It will automatically select the appropriate numerical precision for the operator
In back propagation ,FP16 Gradient numerical overflow problem ,amp Provides a gradient scaling operation , And before the optimizer updates the parameters , Will automatically adjust the gradient unscaling. Therefore, it will not have any impact on the super parameters of model optimization .

The specific implementation process is as follows ：
Normal neural network training ： Forward calculation loss、 Reverse gradient calculation 、 Gradient update .
Mixed precision training ： Copy the weight copy and convert to FP16 Model 、 Forward calculation loss、loss Zoom in 、 Reverse gradient calculation 、 Gradient reduction 、FP16 The gradient of is updated to FP32 Model .
Concrete amp Training process ：
Maintain a FP32 A copy of the numerical accuracy model
At each iteration
Copy and convert to FP16 Model .
Forward propagation （FP16 Model parameters of ）
FP16 The operator of , Direct calculation operation ; Yes FP32 The operator of , The input and output are FP16, The accuracy of the calculation is FP32. The same goes for the reverse
loss Zoom in s times
Back propagation , That is, reverse gradient calculation （FP16 Model parameters and parameter gradients ）
The gradient times 1/s
utilize FP16 Gradient update of FP32 Model parameters of

Where the amplification factor s The choice of , Choosing a constant is not appropriate . because loss And the value of the gradient is variable , therefore s You need to follow loss To dynamically change .
Healthy loss Drop in oscillation , therefore GradScaler The design of the s every other N individual iteration Multiply by one greater than 1 The coefficient of , stay scale loss;
Maintain a FP32 A copy of the numerical accuracy model
At each iteration
Copy and convert to FP16 Model .
Forward propagation （FP16 Model parameters of ）
loss Zoom in s times
Back propagation , That is, reverse gradient calculation （FP16 Model parameters and parameter gradients ）
Check if there is inf perhaps nan The gradient of the parameters of . If there is , Reduce s, Back to step 1
The gradient times 1/s
utilize FP16 Gradient update of FP32 Model parameters of

The basic operations of user training with mixed accuracy are as follows ：
from torch.cuda.amp import GradScaler as GradScaler

# amp rely on Tensor core framework , therefore model Parameter must be cuda tensor type 
model = Net().cuda() optimizer = optim.SGD(model.parameters(), ...)
# GradScaler Object is used to automatically do gradient scaling  
scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        #  stay autocast enable  Area operation forward
        with autocast():
            # model Make one FP16 Copy of ,forward
            output = model(input)
            loss = loss_fn(output, target)
            
        #  use scaler,scale loss(FP16),backward obtain scaled Gradient of (FP16)
        scaler.scale(loss).backward()
        
        # scaler  Update parameters , Will automatically unscale gradient 
        #  If there is nan or inf, Auto skip 
        scaler.step(optimizer)
        
        # scaler factor to update 
        scaler.update()
The details are as follows .

3 aotucast
classs aotucast(device_type, enable=True, **kwargs)
[device_type] (string) Indicates whether to use ‘cuda’ perhaps ‘cpu’ equipment
[enabled] (bool, The default is True) Indicates whether automatic projection is enabled in the area （ Automatic conversion ）
[dtype] (torch_dpython type ) Said the use of torch.float16/ torch.bfloat16
[cache_enabled] (bool, The default is True) Indicates whether to use autocast Weight cache in
explain ：
autocast Instances of can be used as context managers Or decorator , Set the area to run with mixed accuracy
3.1 autocast operator
stay pytorch in , In the use of autocast Region , Some operators will be automatically converted to FP16 Calculate . Only CUDA Operators are eligible for automatic conversion .
amp Auto convert to FP16 The operators of are ：
Auto convert to FP32 The operator of ：
There are still operators not listed , image dot,add,cat… Are based on the greater numerical accuracy in the data , To operate , That is to say FP32 Participate in calculation , Just press the FP32, Is full of FP16 Participate in calculation , Namely FP16.
3.2 Display the conversion accuracy
Get into autocast-enabled Regional time , Tensors can be of any type . When using automatic projection , Should not be called on models or inputs half() or bfloat16().
but , When used as a context manager , Calculation of mixing accuracy enable The area gets FP16 The variable of numerical accuracy is enable Outside the region, it should be explicitly converted to FP32, Otherwise, the use process may lead to the error of type mismatch
#  Create some tensors in the default data type （ It is assumed here that FP32）
a_float32 = torch.rand((8, 8), device="cuda") 
b_float32 = torch.rand((8, 8), device="cuda") 
c_float32 = torch.rand((8, 8), device="cuda") 
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    # torch.mm  Is in  autocast In the list of operators , Will be converted to  FP16.
    #  Input is FP32,  But with FP16 Precision operation calculation , And the output FP16 data 
    #  This process does not need to be set manually 
    e_float16 = torch.mm(a_float32, b_float32)
    #  It can also be a mixed input type 
    f_float16 = torch.mm(d_float32, e_float16)

#  but   In the exit  autocast  after , Use autocast Area generated FP16 variable , You need to convert the displayed into FP32.
g_float32 = torch.mm(d_float32, f_float16.float())
autocast You can also nest ：
#  Create some tensors in the default data type （ It is assumed here that FP32）
a_float32 = torch.rand((8, 8), device="cuda") 
b_float32 = torch.rand((8, 8), device="cuda") 
c_float32 = torch.rand((8, 8), device="cuda") 
d_float32 = torch.rand((8, 8), device="cuda")

with autocast():
    e_float16 = torch.mm(a_float32, b_float32)
    with autocast(enabled=False):
    	f_float32 = torch.mm(c_float32, e_float16.float())
    g_float16 = torch.mm(d_float32, f_float32)
3.3 autocast As a decoration
This situation is generally used in distributed training .autocast Designed as “thread local” Of , So only in main thread Set up autocast The area is not work Of ：
The general call form of non distributed training is ：
model = MyModel() 
with autocast(): 	
	output = model(input)
Distributed training will use nn.DataParalle() or nn.DistributedDataParallel, Creating model Then add the corresponding code , as follows , But this is not effective , there autocast Only in main thread Work in China ：
model = MyModel() 
DP_model = nn.DataParalle(model)  ##  add to 
with autocast(): 	
	output = DP_model(input)
For in other thread At the same time , You need to define forward Also set autocast. There are two ways , Add decorators 、 Add context manager .
##  The way 1： Decorator 
class myModel(nn.Module):
@autocast()
	def forward(self, input):
		pass

##  The way 2： Context manager 
class myModule(nn.Module):
	def forward(self, input):
		with autocast():
			pass

##  Call in main function 
model = MyModel() 
DP_model = nn.DataParalle(model)  ##  add to 

with autocast(): 	
	output = DP_model(input)

4 GradScaler class

When mixed accuracy training is used , There is a situation that cannot converge , The reason is that the value of the activation gradient is too small , It caused an overflow . By using torch.cuda.amp.GradScaler, Zoom in loss Value To prevent gradient underflow.
torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)

【init_scale】 scale factor The initial value of the
【growth_factor】 Every scale factor The growth factor of
【backoff_factor】scale factor The coefficient of descent
【growth_interval】 Every multiple interval growth scale factor
【enabled】 Do you do scale

4.1 GradScaler Methods

scale(output) Method
Yes outputs become scale factor, And back to . If enabled=False, Go straight back
step(optimizer, *args, **kwargs) Method
Completed two functions ： Yes, gradient unscale; Check gradient overflow , without nan/inf, Is executed optimizer Of step, If there is one, just skip
update(new_scale=None) Method
update Method in each iteration You need to call before the end , If parameter update skips , Will give scale factor multiply backoff_factor, Or when it comes to growth iteration, Just give it to scale factor ride growth_factor. You can also use new_scale Direct updating scale factor.

Example ：

model=Net().cuda() 
optimizer=optim.SGD(model.parameters(),...)

scaler = GradScaler() # Instantiate a before training GradScaler object 

 for epoch in epochs:   for input,target in data:
    optimizer.zero_grad()
    with autocast():　＃ Open back and forth autocast
        output=model(input)
        loss = loss_fn(output,targt)

    scaler.scale(loss).backward()  # For gradient magnification 
    #scaler.step()　 First, the gradient value unscale Come back , If the gradient value is not inf or NaN, Call optimizer.step() To update the weights , otherwise , Ignore step call , So as to ensure that the weight is not updated .　　
    scaler.step(optimizer)
    scaler.update()  # Be ready to , See if you want to increase scaler

4.2 GradScaler More applications in gradient processing

Gradient clipping

scaler = GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        #  to unscale  gradient , At this time clip threshold In order to correctly use the gradient 
        scaler.unscale_(optimizer)
        # clip gradient 
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

        # unscale_()  Has been explicitly called ,scaler perform step No longer unscalse Update parameters , Yes nan/inf I'll also skip 
        scaler.step(optimizer)
        scaler.update()

Gradient accumulation

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            # loss  according to   The accumulated times are normalized 
            loss = loss / iters_to_accumulate

        # scale  Unified loss  and backward 
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired 
            # (e.g., to allow clipping unscaled gradients)

            # step() and update() proceed as usual.
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

Gradient penalty

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
        #  Prevent overflow , If it's not autocast  Area , First use scaled loss  obtain  scaled  gradient 
        scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss),
                                                 inputs=model.parameters(),
                                                 create_graph=True)
        #  gradient unscale
        inv_scale = 1./scaler.get_scale()
        grad_params = [p * inv_scale for p in scaled_grad_params]
        #  stay autocast  Area ,loss  Plus the gradient penalty term 
        with autocast():
            grad_norm = 0
            for grad in grad_params:
                grad_norm += grad.pow(2).sum()
            grad_norm = grad_norm.sqrt()
            loss = loss + grad_norm

        scaler.scale(loss).backward()

        # may unscale_ here if desired 
        # (e.g., to allow clipping unscaled gradients)

        # step() and update() proceed as usual.
        scaler.step(optimizer)
        scaler.update()

4.5 Multiple models

Just use one scaler Operate on multiple models , but scale(loss) and step(optimizer) Execute separately

scaler = torc
h.cuda.amp.GradScaler()

for epoch in epochs:
    for input, target in data:
        optimizer0.zero_grad()
        optimizer1.zero_grad()
        with autocast():
            output0 = model0(input)
            output1 = model1(input)
            loss0 = loss_fn(2 * output0 + 3 * output1, target)
            loss1 = loss_fn(3 * output0 - 5 * output1, target)

        #  there retain_graph And amp irrelevant , It appears here because in this example , Two backward()  Calls share some parts of the graph .
        scaler.scale(loss0).backward(retain_graph=True)
        scaler.scale(loss1).backward()

        #  If you want to check or modify the gradient of the parameters it owns , You can choose the corresponding optimizer   Perform explicit unzoom .
        scaler.unscale_(optimizer0)

        scaler.step(optimizer0)
        scaler.step(optimizer1)

        scaler.update()

5 Precautions for mixing accuracy
As far as possible in have Tensor Core Architecturally GPU Use amp.
In the absence of Tensor Core Architecturally GPU Upper use amp, The video memory will be significantly reduced , But the speed will drop more . Concrete , stay Turing Architecturally GTX 1660 Upper use amp, Computing time increases Double , The video memory is less than half of the original
Constant range ： In order to ensure that the calculation does not overflow , First, ensure that the manually set constant does not overflow . Such as epsilon、INF etc.
Dimension It is best to 8 Multiple ： Dimension is 8 Multiple , Best performance

原网站

版权声明
本文为[magic_ ll]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207011720451819.html