当前位置:网站首页>[pytorch record] automatic hybrid accuracy training torch cuda. amp
[pytorch record] automatic hybrid accuracy training torch cuda. amp
2022-07-01 19:19:00 【magic_ ll】
Nvidia stay Volta Introduce... Into the architecture Tensor Core unit , To support FP32 and FP16 Calculation of mixing accuracy .tensor core It is a calculation unit of matrix multiplication and accumulation , Every tensor core When the 64 Floating point mixed precision operations (FP16 Matrix multiplication and FP32 Add up ).
In the same year, a pytorch Expand apex, To support the automatic mixing accuracy training of model parameters .
Pytorch1.6 After the version , Start native support amp, namelytorch.cuda.amp, yes nvidia Developers contribute to pytorch Inside , Only support tensor core Of CUDA Hardware can enjoy amp Advantages .
1 FP16 Semi precision
FP16 and FP32, Is the binary floating-point data type used by the computer .
FP16 Half precision , Use 2 Bytes .FP32 namely Float.
among ,sign Is positive or negative ,exponent Bitwise index 2 ( n − 15 + 1 ) 2^{(n-15+1)} 2(n−15+1), The specific details are not explained here . Baidu when you need to see .
float The representation of a type in memory
Use alone FP16:
- advantage :
- Reduce the occupation of video memory , So as to support more batchsize、 Larger model and larger input size Training , Sometimes, it will improve the accuracy
Speed up the calculation of training and reasoning , It can double the speed- shortcoming :
- Spillover problem :
because FP16 The dynamic range ratio of FP32 The numerical range of is much smaller , Therefore, it is easy to have up overflow and down overflow in the calculation process , And then there was "NAN" The problem of . In deep learning , Because the gradient of the activation function is often smaller than the weight gradient , It is more prone to overflow .
When the first L When the gradient of the layer overflows , The first L-1 The weights of all previous layers cannot be updated- Rounding error
finger When the ladder goes through hours , Less than the minimum interval in the current interval , This gradient update may fail . such as
FP16 Of 2 − 3 + 2 − 14 = 2 − 3 2^{-3}+2^{-14}=2^{-3} 2−3+2−14=2−3, At this point, rounding error occurs : stay [ 2 − 3 , 2 − 2 ] [2^{-3}, 2^{-2}] [2−3,2−2] between , Than 2 − 3 2^{-3} 2−3 The next big number is ( 2 − 3 + 2 − 13 2^{-3}+2^{-13} 2−3+2−13)
import numpy as np a = np.array(2**(-3),dtype=np.float16) b = np.array(2**(-14),dtype=np.float16) c = a+b print(a) # 0.125 print('%f'%b) # 0.000061 print(c) # 0.125pytorch Data types in :
stay pytorch in , Altogether 10 Type of tensor:
torch.FloatTensor– 32bit floating point (pytorch Created by default tensor The type of )torch.DoubleTensor– 64bit floating pointtorch.HalfTensor– 16bit floating piont1torch.BFloat16Tensor– 16bit floating piont2torch.ByteTensor– 8bit integer(unsigned)torch.CharTensor– 8bit integer(signed)torch.ShortTensor– 16bit integer(signed)torch.IntTensor– 32bit integer(signed)torch.LongTensor– 64bit integer(signed)torch.BoolTensor– Booleanimport torch tensor = torch.zeros(20,20) print(tensor.type())
2 Hybrid accuracy training mechanism
Automatic mixing accuracy (Automatic Mixed Precision, AMP) Training , Is training a numerical accuracy of 32 The model of , Operation of some operators The numerical accuracy is FP16, The operation accuracy of other operators is FP32. Which specific operators use precision ,amp It is automatically set , No additional user settings are required .
This does not change the model 、 Without reducing the accuracy of model training , It can shorten the training time , Reduce storage requirements , So as to support more batchsize、 Larger model and larger input size Training .
torch.cuda.ampIt provides users with a very convenient hybrid accuracy training mechanism , By usingamp.autocastandamp.GradScalerTo achieve :
- Users do not need to manually adjust the model parameters dtype,amp It will automatically select the appropriate numerical precision for the operator
- In back propagation ,FP16 Gradient numerical overflow problem ,amp Provides a gradient scaling operation , And before the optimizer updates the parameters , Will automatically adjust the gradient unscaling. Therefore, it will not have any impact on the super parameters of model optimization .
The specific implementation process is as follows :Normal neural network training : Forward calculation loss、 Reverse gradient calculation 、 Gradient update .
Mixed precision training : Copy the weight copy and convert to FP16 Model 、 Forward calculation loss、loss Zoom in 、 Reverse gradient calculation 、 Gradient reduction 、FP16 The gradient of is updated to FP32 Model .Concrete amp Training process :
- Maintain a FP32 A copy of the numerical accuracy model
- At each iteration
- Copy and convert to FP16 Model .
- Forward propagation (FP16 Model parameters of )
FP16 The operator of , Direct calculation operation ; Yes FP32 The operator of , The input and output are FP16, The accuracy of the calculation is FP32. The same goes for the reverse- loss Zoom in s times
- Back propagation , That is, reverse gradient calculation (FP16 Model parameters and parameter gradients )
- The gradient times 1/s
- utilize FP16 Gradient update of FP32 Model parameters of
Where the amplification factor s The choice of , Choosing a constant is not appropriate . because loss And the value of the gradient is variable , therefore s You need to follow loss To dynamically change .
Healthy loss Drop in oscillation , thereforeGradScalerThe design of the s every other N individual iteration Multiply by one greater than 1 The coefficient of , stay scale loss;
- Maintain a FP32 A copy of the numerical accuracy model
- At each iteration
- Copy and convert to FP16 Model .
- Forward propagation (FP16 Model parameters of )
- loss Zoom in s times
- Back propagation , That is, reverse gradient calculation (FP16 Model parameters and parameter gradients )
- Check if there is
infperhapsnanThe gradient of the parameters of . If there is , Reduce s, Back to step 1- The gradient times 1/s
- utilize FP16 Gradient update of FP32 Model parameters of
The basic operations of user training with mixed accuracy are as follows :
from torch.cuda.amp import GradScaler as GradScaler # amp rely on Tensor core framework , therefore model Parameter must be cuda tensor type model = Net().cuda() optimizer = optim.SGD(model.parameters(), ...) # GradScaler Object is used to automatically do gradient scaling scaler = GradScaler() for epoch in epochs: for input, target in data: optimizer.zero_grad() # stay autocast enable Area operation forward with autocast(): # model Make one FP16 Copy of ,forward output = model(input) loss = loss_fn(output, target) # use scaler,scale loss(FP16),backward obtain scaled Gradient of (FP16) scaler.scale(loss).backward() # scaler Update parameters , Will automatically unscale gradient # If there is nan or inf, Auto skip scaler.step(optimizer) # scaler factor to update scaler.update()The details are as follows .
3 aotucast
classs aotucast(device_type, enable=True, **kwargs)
- [device_type] (string) Indicates whether to use ‘cuda’ perhaps ‘cpu’ equipment
- [enabled] (bool, The default is True) Indicates whether automatic projection is enabled in the area ( Automatic conversion )
- [dtype] (torch_dpython type ) Said the use of torch.float16/ torch.bfloat16
- [cache_enabled] (bool, The default is True) Indicates whether to use autocast Weight cache in
explain :
- autocast Instances of can be used as context managers Or decorator , Set the area to run with mixed accuracy
3.1 autocast operator
stay pytorch in , In the use of autocast Region , Some operators will be automatically converted to FP16 Calculate . Only CUDA Operators are eligible for automatic conversion .
- amp Auto convert to FP16 The operators of are :
- Auto convert to FP32 The operator of :
- There are still operators not listed , image dot,add,cat… Are based on the greater numerical accuracy in the data , To operate , That is to say FP32 Participate in calculation , Just press the FP32, Is full of FP16 Participate in calculation , Namely FP16.
3.2 Display the conversion accuracy
Get into autocast-enabled Regional time , Tensors can be of any type . When using automatic projection , Should not be called on models or inputs half() or bfloat16().
but , When used as a context manager , Calculation of mixing accuracy enable The area gets FP16 The variable of numerical accuracy is enable Outside the region, it should be explicitly converted to FP32, Otherwise, the use process may lead to the error of type mismatch# Create some tensors in the default data type ( It is assumed here that FP32) a_float32 = torch.rand((8, 8), device="cuda") b_float32 = torch.rand((8, 8), device="cuda") c_float32 = torch.rand((8, 8), device="cuda") d_float32 = torch.rand((8, 8), device="cuda") with autocast(): # torch.mm Is in autocast In the list of operators , Will be converted to FP16. # Input is FP32, But with FP16 Precision operation calculation , And the output FP16 data # This process does not need to be set manually e_float16 = torch.mm(a_float32, b_float32) # It can also be a mixed input type f_float16 = torch.mm(d_float32, e_float16) # but In the exit autocast after , Use autocast Area generated FP16 variable , You need to convert the displayed into FP32. g_float32 = torch.mm(d_float32, f_float16.float())
autocast You can also nest :# Create some tensors in the default data type ( It is assumed here that FP32) a_float32 = torch.rand((8, 8), device="cuda") b_float32 = torch.rand((8, 8), device="cuda") c_float32 = torch.rand((8, 8), device="cuda") d_float32 = torch.rand((8, 8), device="cuda") with autocast(): e_float16 = torch.mm(a_float32, b_float32) with autocast(enabled=False): f_float32 = torch.mm(c_float32, e_float16.float()) g_float16 = torch.mm(d_float32, f_float32)3.3 autocast As a decoration
This situation is generally used in distributed training .autocast Designed as “thread local” Of , So only in main thread Set up autocast The area is not work Of :
The general call form of non distributed training is :
model = MyModel() with autocast(): output = model(input)
Distributed training will usenn.DataParalle()ornn.DistributedDataParallel, Creating model Then add the corresponding code , as follows , But this is not effective , there autocast Only in main thread Work in China :model = MyModel() DP_model = nn.DataParalle(model) ## add to with autocast(): output = DP_model(input)
For in other thread At the same time , You need to define forward Also set autocast. There are two ways , Add decorators 、 Add context manager .## The way 1: Decorator class myModel(nn.Module): @autocast() def forward(self, input): pass ## The way 2: Context manager class myModule(nn.Module): def forward(self, input): with autocast(): pass ## Call in main function model = MyModel() DP_model = nn.DataParalle(model) ## add to with autocast(): output = DP_model(input)
4 GradScaler class
When mixed accuracy training is used , There is a situation that cannot converge , The reason is that the value of the activation gradient is too small , It caused an overflow . By using
torch.cuda.amp.GradScaler, Zoom in loss Value To prevent gradient underflow.torch.cuda.amp.GradScaler(init_scale=65536.0, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True)
- 【init_scale】 scale factor The initial value of the
- 【growth_factor】 Every scale factor The growth factor of
- 【backoff_factor】scale factor The coefficient of descent
- 【growth_interval】 Every multiple interval growth scale factor
- 【enabled】 Do you do scale
4.1 GradScaler Methods
scale(output)Method
Yes outputs become scale factor, And back to . If enabled=False, Go straight backstep(optimizer, *args, **kwargs)Method
Completed two functions : Yes, gradient unscale; Check gradient overflow , without nan/inf, Is executed optimizer Of step, If there is one, just skipupdate(new_scale=None)Method
update Method in each iteration You need to call before the end , If parameter update skips , Will give scale factor multiply backoff_factor, Or when it comes to growth iteration, Just give it to scale factor ride growth_factor. You can also use new_scale Direct updating scale factor.
Example :
model=Net().cuda() optimizer=optim.SGD(model.parameters(),...) scaler = GradScaler() # Instantiate a before training GradScaler object for epoch in epochs: for input,target in data: optimizer.zero_grad() with autocast(): # Open back and forth autocast output=model(input) loss = loss_fn(output,targt) scaler.scale(loss).backward() # For gradient magnification #scaler.step() First, the gradient value unscale Come back , If the gradient value is not inf or NaN, Call optimizer.step() To update the weights , otherwise , Ignore step call , So as to ensure that the weight is not updated . scaler.step(optimizer) scaler.update() # Be ready to , See if you want to increase scaler4.2 GradScaler More applications in gradient processing
Gradient clipping
scaler = GradScaler() for epoch in epochs: for input, target in data: optimizer.zero_grad() with autocast(): output = model(input) loss = loss_fn(output, target) scaler.scale(loss).backward() # to unscale gradient , At this time clip threshold In order to correctly use the gradient scaler.unscale_(optimizer) # clip gradient torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) # unscale_() Has been explicitly called ,scaler perform step No longer unscalse Update parameters , Yes nan/inf I'll also skip scaler.step(optimizer) scaler.update()
Gradient accumulationscaler = GradScaler() for epoch in epochs: for i, (input, target) in enumerate(data): with autocast(): output = model(input) loss = loss_fn(output, target) # loss according to The accumulated times are normalized loss = loss / iters_to_accumulate # scale Unified loss and backward scaler.scale(loss).backward() if (i + 1) % iters_to_accumulate == 0: # may unscale_ here if desired # (e.g., to allow clipping unscaled gradients) # step() and update() proceed as usual. scaler.step(optimizer) scaler.update() optimizer.zero_grad()
Gradient penaltyfor epoch in epochs: for input, target in data: optimizer.zero_grad() with autocast(): output = model(input) loss = loss_fn(output, target) # Prevent overflow , If it's not autocast Area , First use scaled loss obtain scaled gradient scaled_grad_params = torch.autograd.grad(outputs=scaler.scale(loss), inputs=model.parameters(), create_graph=True) # gradient unscale inv_scale = 1./scaler.get_scale() grad_params = [p * inv_scale for p in scaled_grad_params] # stay autocast Area ,loss Plus the gradient penalty term with autocast(): grad_norm = 0 for grad in grad_params: grad_norm += grad.pow(2).sum() grad_norm = grad_norm.sqrt() loss = loss + grad_norm scaler.scale(loss).backward() # may unscale_ here if desired # (e.g., to allow clipping unscaled gradients) # step() and update() proceed as usual. scaler.step(optimizer) scaler.update()4.5 Multiple models
Just use one scaler Operate on multiple models , but scale(loss) and step(optimizer) Execute separately
scaler = torc h.cuda.amp.GradScaler() for epoch in epochs: for input, target in data: optimizer0.zero_grad() optimizer1.zero_grad() with autocast(): output0 = model0(input) output1 = model1(input) loss0 = loss_fn(2 * output0 + 3 * output1, target) loss1 = loss_fn(3 * output0 - 5 * output1, target) # there retain_graph And amp irrelevant , It appears here because in this example , Two backward() Calls share some parts of the graph . scaler.scale(loss0).backward(retain_graph=True) scaler.scale(loss1).backward() # If you want to check or modify the gradient of the parameters it owns , You can choose the corresponding optimizer Perform explicit unzoom . scaler.unscale_(optimizer0) scaler.step(optimizer0) scaler.step(optimizer1) scaler.update()
5 Precautions for mixing accuracy
- As far as possible in have Tensor Core Architecturally GPU Use amp.
In the absence of Tensor Core Architecturally GPU Upper use amp, The video memory will be significantly reduced , But the speed will drop more . Concrete , stay Turing Architecturally GTX 1660 Upper use amp, Computing time increases Double , The video memory is less than half of the original- Constant range : In order to ensure that the calculation does not overflow , First, ensure that the manually set constant does not overflow . Such as epsilon、INF etc.
- Dimension It is best to 8 Multiple : Dimension is 8 Multiple , Best performance
边栏推荐
- kubernetes命令入门(namespaces,pods)
- 制造业SRM管理系统供应商全方位闭环管理,实现采购寻源与流程高效协同
- 毕业季 | 华为专家亲授面试秘诀:如何拿到大厂高薪offer?
- ETL development of data warehouse (IV)
- 如何使用物联网低代码平台进行个人设置?
- 【6.24-7.1】写作社区精彩技术博文回顾
- Lake shore optimag superconducting magnet system om series
- AI training speed breaks Moore's law; Song shuran's team won the RSS 2022 Best Paper Award
- 【Go ~ 0到1 】 第五天 7月1 类型别名,自定义类型,接口,包与初始化函数
- 前4A高管搞代运营,拿下一个IPO
猜你喜欢

MySQL common graphics management tools | dark horse programmers

Chinese and English instructions human soluble advanced glycation end products receptor (sRAGE) ELISA Kit

有关 M91 快速霍尔测量仪的更多信息

【pytorch记录】自动混合精度训练 torch.cuda.amp

11、用户、组和权限(1)

Supervarimag superconducting magnet system SVM series

网易游戏,激进出海

【Go ~ 0到1 】 第五天 7月1 类型别名,自定义类型,接口,包与初始化函数

Manufacturing SRM management system supplier all-round closed-loop management, to achieve procurement sourcing and process efficient collaboration

Chaos engineering platform chaosblade box new heavy release
随机推荐
学习笔记-JDBC连接数据库操作的步骤
The best landing practice of cave state in an Internet ⽹⾦ financial technology enterprise
Usage and underlying implementation principle of PriorityQueue
Prices of Apple products rose across the board in Japan, with iphone13 up 19%
华为联机对战服务玩家掉线重连案例总结
Go language self-study series | go language data type
Improve yolov5 with gsconv+slim neck to maximize performance!
How to realize the applet in its own app to realize continuous live broadcast
Boost the development of digital economy and consolidate the base of digital talents - the digital talent competition was successfully held in Kunming
MySQL常用图形管理工具 | 黑马程序员
Altair HyperWorks 2022 software installation package and installation tutorial
数据库基础:select基本查询语句
Summary of cases of players' disconnection and reconnection in Huawei online battle service
Lumiprobe 细胞成像研究丨PKH26细胞膜标记试剂盒
如何使用物联网低代码平台进行个人设置?
精耕渠道共谋发展 福昕携手伟仕佳杰开展新产品培训大会
linux下清理系统缓存并释放内存
Solution of digital supply chain centralized purchase platform in mechanical equipment industry: optimize resource allocation and realize cost reduction and efficiency increase
网易游戏,激进出海
Shell array

