当前位置:网站首页>Pytorch model trains practical skills and breaks through the bottleneck of speed
Pytorch model trains practical skills and breaks through the bottleneck of speed
2022-07-07 14:31:00 【Xiaobai learns vision】
Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”
Heavy dry goods , First time delivery
Reading guide
One step by step Guide to , Very practical .
Don't let your neural network become like thisLet's face it , Your model may still be in the stone age . I bet you still use 32 Bit accuracy or GASP Even in one GPU Training .
I understand, , There are all kinds of neural network acceleration guides on the Internet , But one checklist None ( There is now a ), Use this list , Step by step, make sure you squeeze all the performance out of your model .
This guide ranges from the simplest structure to the most complex changes , Can make your network get the most benefit . I'll show you an example Pytorch Code and can be in Pytorch- lightning Trainer Related to the use of flags, So you don't have to write the code yourself !
** Who is this guide for ?** Any use Pytorch People who do in-depth learning model research , Like researchers 、 Doctor 、 Scholars, etc , The model we're talking about here may take you a few days of training , Even weeks or months .
We will talk about :
Use DataLoaders
DataLoader Medium workers Number
Batch size
Gradient accumulation
Reserved calculation chart
Move to a single
16-bit Mixed precision training
Move to multiple GPUs in ( Model replication )
Move to multiple GPU-nodes in (8+GPUs)
Thinking model acceleration techniques
Pytorch-Lightning
You can Pytorch The library of Pytorch- lightning Find every optimization I've discussed here .Lightning Is in Pytorch A package above , It can train automatically , At the same time, it gives researchers complete control over key model components .Lightning Use the latest best practices , And minimize where you might go wrong .
We are MNIST Definition LightningModel And use Trainer To train the model .
from pytorch_lightning import Trainer
model = LightningModule(…)
trainer = Trainer()
trainer.fit(model)
1. DataLoaders
This is probably the easiest place to get speed gain . preservation h5py or numpy The era of files to speed up data loading is gone , Use Pytorch dataloader Loading image data is simple ( about NLP data , Please check out TorchText).
stay lightning in , You don't need to specify a training cycle , Just define dataLoaders and Trainer They will be called when they need to .
dataset = MNIST(root=self.hparams.data_root, train=train, download=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
x, y = batch
model.training_step(x, y)
...
2. DataLoaders Medium workers The number of
Another amazing thing about acceleration is that it allows batch parallel loading . therefore , You can load nb_workers individual batch, Instead of loading one at a time batch.
# slow
loader = DataLoader(dataset, batch_size=32, shuffle=True)
# fast (use 10 workers)
loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=10)
3. Batch size
Before you start the next optimization step , take batch size Increase to CPU-RAM or GPU-RAM The maximum range allowed .
The next section focuses on how to help reduce memory usage , So that you can continue to add batch size.
remember , You may need to update your learning rate again . A good rule of thumb is , If batch size double , Then the learning rate will double .
4. Gradient accumulation
When you've reached the computing resource limit , Yours batch size Still too small ( such as 8), Then we need to simulate a larger batch size To make a gradient descent , To provide a good estimate .
Suppose we want to achieve 128 Of batch size size . We need to batch size by 8 perform 16 Forward and backward , Then perform the optimization step again .
# clear last step
optimizer.zero_grad()
# 16 accumulated gradient steps
scaled_loss = 0
for accumulated_step_i in range(16):
out = model.forward()
loss = some_loss(out,y)
loss.backward()
scaled_loss += loss.item()
# update weights after 8 steps. effective batch = 8*16
optimizer.step()
# loss is now scaled up by the number of accumulated batches
actual_loss = scaled_loss / 16
stay lightning in , It's all done for you , Just set up accumulate_grad_batches=16
:
trainer = Trainer(accumulate_grad_batches=16)
trainer.fit(model)
5. Reserved calculation chart
One of the easiest ways to burst your memory is to log your memory loss.
losses = []
...
losses.append(loss)
print(f'current loss: {torch.mean(losses)'})
The question above is ,loss Still contains copies of the entire diagram . under these circumstances , call .item() To release it .
![1_CER3v8cok2UOBNsmnBrzPQ](9 Tips For Training Lightning-Fast Neural Networks In Pytorch.assets/1_CER3v8cok2UOBNsmnBrzPQ.gif)# bad
losses.append(loss)
# good
losses.append(loss.item())
Lightning Will be very careful , Make sure that copies of the calculation diagram are not retained .
6. Single GPU Training
Once you have completed the previous steps , It's time to enter GPU Trained . stay GPU Training on will make multiple GPU cores The mathematical computation between is parallelized . The acceleration you get depends on what you use GPU type . I recommend personal use 2080Ti, Company use V100.
At first glance , It can be overwhelming , But you really just need to do two things :1) Move your model to GPU, 2) Whenever you run data through it , Put the data in GPU On .
# put model on GPU
model.cuda(0)
# put data on gpu (cuda on a variable returns a cuda copy)
x = x.cuda(0)
# runs on GPU now
model(x)
If you use Lightning, You don't have to do anything , Just set up Trainer(gpus=1)
.
# ask lightning to use gpu 0 for training
trainer = Trainer(gpus=[0])
trainer.fit(model)
stay GPU When you're training on , The main thing to be aware of is the limitation CPU and GPU The number of transfers between .
# expensive
x = x.cuda(0)# very expensive
x = x.cpu()
x = x.cuda(0)
If memory runs out , Don't move data back to CPU To save memory . Asking for help GPU Before , Try to optimize your code in other ways or GPU Between the memory distribution .
Another thing to note is call coercion GPU Synchronous operation . Clearing memory cache is an example .
# really bad idea. Stops all the GPUs until they all catch up
torch.cuda.empty_cache()
however , If you use Lightning, The only thing that can go wrong is in defining Lightning Module when .Lightning You will pay special attention not to make such mistakes .
7. 16-bit precision
16bit Precision is an amazing technique for halving memory usage . Most models use 32bit Precision figures for training . However , Recent research has found ,16bit Models can also work well . Mixed precision means using 16bit, But keep the weight and so on in 32bit.
To be in Pytorch Use in 16bit precision , Please install NVIDIA Of apex library , And make these changes to your model .
# enable 16-bit on the model and the optimizer
model, optimizers = amp.initialize(model, optimizers, opt_level='O2')
# when doing .backward, let amp do it so it can scale the loss
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
amp The bag will handle most of the things . If the gradient explodes or tends to 0, It even zooms loss.
stay lightning in , Enable 16bit There is no need to modify anything in the model , You don't need to do what I wrote above . Set up Trainer(precision=16)
That's all right. .
trainer = Trainer(amp_level='O2', use_amp=False)
trainer.fit(model)
8. Move to multiple GPUs in
Now? , It became very interesting . Yes 3 Kind of ( Maybe more ?) Methods to do more GPU Training .
branch batch Training
A) Copy the model to each GPU in ,B) For each GPU Part of the batchThe first method is called “ branch batch Training ”. This policy copies the model to each GPU On , Every GPU get batch Part of .
# copy model on each GPU and give a fourth of the batch to each
model = DataParallel(model, devices=[0, 1, 2 ,3])
# out has 4 outputs (one for each gpu)
out = model(x.cuda(0))
stay lightning in , You just need to add GPUs The number of , Then tell trainer, Nothing else to do .
# ask lightning to use 4 GPUs for training
trainer = Trainer(gpus=[0, 1, 2, 3])
trainer.fit(model)
Model distribution training
Put different parts of the model in different GPU On ,batch Move in orderSometimes your model may be too big to fit completely into memory . for example , Sequence to sequence models with encoders and decoders may take up 20GB RAM. In this case , We want to put the encoder and decoder in separate GPU On .
# each model is sooo big we can't fit both in memory
encoder_rnn.cuda(0)
decoder_rnn.cuda(1)
# run input through encoder on GPU 0
encoder_out = encoder_rnn(x.cuda(0))
# run output through decoder on the next GPU
out = decoder_rnn(encoder_out.cuda(1))
# normally we want to bring all outputs back to GPU 0
out = out.cuda(0)
For this type of training , stay Lightning You don't need to specify any GPU, You should put LightningModule Put the module in the correct GPU On .
class MyModule(LightningModule):
def __init__():
self.encoder = RNN(...)
self.decoder = RNN(...)
def forward(x):
# models won't be moved after the first forward because
# they are already on the correct GPUs
self.encoder.cuda(0)
self.decoder.cuda(1)
out = self.encoder(x)
out = self.decoder(out.cuda(1))
# don't pass GPUs to trainer
model = MyModule()
trainer = Trainer()
trainer.fit(model)
A mixture of the two
In the case above , Coders and decoders can still benefit from parallelization operations .
# change these lines
self.encoder = RNN(...)
self.decoder = RNN(...)
# to these
# now each RNN is based on a different gpu set
self.encoder = DataParallel(self.encoder, devices=[0, 1, 2, 3])
self.decoder = DataParallel(self.encoder, devices=[4, 5, 6, 7])
# in forward...
out = self.encoder(x.cuda(0))
# notice inputs on first gpu in device
sout = self.decoder(out.cuda(4)) # <--- the 4 here
The use of multiple GPU What to consider when :
If the model is already in GPU Yes ,model.cuda() Will not do anything .
Always put input on the first device in the device list .
It's expensive to transfer data between devices , Use it as a last resort .
Optimizer and gradient will be saved in GPU 0 On , therefore ,GPU 0 The memory used on the GPU Much more .
9. multi-node GPU Training
Every... On every machine GPU There's a copy of the model . Each machine gets part of the data , And only training in that part . Every machine can synchronize gradients .
If you've done that , So you can now train in a few minutes Imagenet 了 ! It's not as hard as you think , But it may require you to know more about computing clusters . These instructions assume that you are using SLURM.
Pytorch Allow multi node training , By copying each... On each node GPU On the model and synchronize the gradient . therefore , Every model is in every GPU Independently initialized on , Essentially train independently on a partition of the data , Except that they all receive gradient updates from all models .
At a high level :
At every GPU Initializes a copy of a model on ( Make sure to set the seed , Let each model be initialized to the same weight , Otherwise it will fail ).
Divide the data set into subsets ( Use DistributedSampler). Every GPU Only train on its own kid set .
stay .backward() On , All copies receive gradient copies of all models . This is the only communication between models .
Pytorch There's a good abstraction , be called DistributedDataParallel, It can help you achieve this function . To use DDP, You need to do 4 Things about :
def tng_dataloader():
d = MNIST()
# 4: Add distributed sampler
# sampler sends a portion of tng data to each machine
dist_sampler = DistributedSampler(dataset)
dataloader = DataLoader(d, shuffle=False, sampler=dist_sampler)
def main_process_entrypoint(gpu_nb):
# 2: set up connections between all gpus across all machines
# all gpus connect to a single GPU "root"
# the default uses env://
world = nb_gpus * nb_nodes
dist.init_process_group("nccl", rank=gpu_nb, world_size=world)
# 3: wrap model in DPP
torch.cuda.set_device(gpu_nb)
model.cuda(gpu_nb)
model = DistributedDataParallel(model, device_ids=[gpu_nb])
# train your model now...
if __name__ == '__main__':
# 1: spawn number of processes
# your cluster will call main for each machine
mp.spawn(main_process_entrypoint, nprocs=8)
However , stay Lightning in , Just set the number of nodes , It will deal with the rest of the things for you .
# train on 1024 gpus across 128 nodes
trainer = Trainer(nb_gpu_nodes=128, gpus=[0, 1, 2, 3, 4, 5, 6, 7])
Lightning And it comes with a SlurmCluster Manager , It can help you submit SLURM The correct details of the assignment .
10. welfare ! More on a single node GPU Faster training
The fact proved that ,distributedDataParallel Than DataParallel Much faster , Because it only performs gradient synchronous communication . therefore , A good hack It's using distributedDataParallel Replace DataParallel, Even if you train on a single machine .
stay Lightning in , It's easy to get distributed_backend Set to ddp And set up GPUs To achieve .
# train on 4 gpus on the same machine MUCH faster than DataParallel
trainer = Trainer(distributed_backend='ddp', gpus=[0, 1, 2, 3])
Thinking about model acceleration
Although this guide will provide you with a series of tips for improving network speed , But I'm going to explain to you how to think through bottlenecks .
I divided the model into several parts :
First , I want to make sure there are no bottlenecks in data loading . So , I used the existing data loading solution I described , But if you don't have a solution that meets your needs , Consider offline processing and caching to high-performance data storage , such as h5py.
Next, let's see what you're going to do in the training steps . Make sure you move forward fast , Avoid too much computation and minimization CPU and GPU Data transfer between . Last , Avoid doing something that will reduce GPU The speed thing ( This guide introduces ).
Next , I try to maximize my batch size, This is usually due to GPU Memory size limit . Now? , Need to focus on using big batch size How to be in multiple GPUs Distribute and minimize latency ( such as , I might try to be in more than one gpu Upper use 8000 + Effective batch size).
However , You need to be careful with big batch size. For your specific problem , Please refer to the relevant literature , Look at what people are ignoring !
The good news !
Xiaobai learns visual knowledge about the planet
Open to the outside world
download 1:OpenCV-Contrib Chinese version of extension module
stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .
download 2:Python Visual combat project 52 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .
download 3:OpenCV Actual project 20 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .
Communication group
Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San + Shanghai Jiaotong University + Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~
边栏推荐
- Substance Painter笔记:多显示器且多分辨率显示器时的设置
- Beginner JSP
- Internal sort - insert sort
- Data flow diagram, data dictionary
- 【愚公系列】2022年7月 Go教学课程 005-变量
- Hangdian oj2092 integer solution
- 设备故障预测机床故障提前预警机械设备振动监测机床故障预警CNC震动无线监控设备异常提前预警
- Similarities and differences between switches and routers
- KITTI数据集简介与使用
- Huawei cloud database DDS products are deeply enabled
猜你喜欢
UML state diagram
Démontage de la fonction du système multi - Merchant Mall 01 - architecture du produit
GVIM [III] [u vimrc configuration]
gvim【三】【_vimrc配置】
Huawei cloud database DDS products are deeply enabled
今日睡眠质量记录78分
Because the employee set the password to "123456", amd stolen 450gb data?
Docker deploy Oracle
PERT图(工程网络图)
华为云数据库DDS产品深度赋能
随机推荐
ndk初学习(一)
云上“视界” 创新无限 | 2022阿里云直播峰会正式上线
AWS learning notes (III)
Million data document access of course design
NDK beginner's study (1)
Leetcode——344. Reverse string /541 Invert string ii/151 Reverse the word / Sword finger in the string offer 58 - ii Rotate string left
内部排序——插入排序
Introduction to sakt method
C # switch pages through frame and page
Data connection mode in low code platform (Part 2)
JS in the browser Base64, URL, blob mutual conversion
Cocos creator direction and angle conversion
Attribute keywords ondelete, private, readonly, required
属性关键字ServerOnly,SqlColumnNumber,SqlComputeCode,SqlComputed
【历史上的今天】7 月 7 日:C# 发布;Chrome OS 问世;《仙剑奇侠传》发行
Notes de l'imprimante substance: paramètres pour les affichages Multi - écrans et multi - Résolutions
PyTorch模型训练实战技巧,突破速度瓶颈
Excuse me, does PTS have a good plan for database pressure measurement?
半小时『直播连麦搭建』动手实战,大学生技术岗位简历加分项get!
LeetCode每日一题(636. Exclusive Time of Functions)