当前位置:网站首页>[pytorch record] distributed training dataparallel and distributeddataparallel of the model
[pytorch record] distributed training dataparallel and distributeddataparallel of the model
2022-07-01 19:19:00 【magic_ ll】
Use more GPU When training neural networks ,pytorch There are corresponding api Put the model in more GPU Up operation :
. The latter has many advantages , Come down and start recording the difference
!!! Procrastination needs to be overcome
API explain
gpus=[0,1] torch.nn.DataParallel(model.cuda(), decice_ids=gpus, output_device=gpus[0])
Parameters 【module】 Defined model .【device_ids】 Training network GPU Number .【output_device】 Output the result device, Need to complete each gpu Data summary and other calculations . The default is gpus[0]
DataParallel The parallel processing mechanism of is :
- [ Lord gpu] take Data and models Dissemination of
Read data from the hard disk Page locking memory to the host , And then transmit it to [ Lord gpu] In the video memory of , And then to batch In the form of Assigned to each gpu On ; Load the model into [ Lord gpu] On , Then copy the model to other gpu On- Every gpu Upper forward
Every gpu On a separate thread Do it independently of your own data forward To calculate the output;- [ Lord gpu] Collect the output of the network 、 Calculation loss、 spread loss
In the main gpu Collect each gpu Of the Internet output; The loss function value is calculated by comparing the network output with the real data label of each element in the batch loss;; And then loss Distribute to each gpu- Every gpu Conduct backward
Every gpu Carry out back-propagation respectively on , Calculate the gradient- [ Lord gpu] Gradient summary 、 Weight update 、 Weights are synchronized to other gpu
All gradients are summarized to the master gpu Add additivity , And then the gradient goes down Weight update , Then distribute the updated weight to each gpu On .Other instructions
Use single process control , Load models and data into multiple GPU in .dataparallel It can be seen as taking the training parameters from the master gpu Copy to other gpu, Every gpu All responsible forward and backward( Only calculate the gradient without updating the weight ), Lord gpu There is also additional responsibility : Every gpu Of output A copy of the 、loss The calculation of 、 Summary of gradients 、 Weight update 、 The weight is copied to each gpu On 、 Redundant data copies ( Data is read from the hard disk to the master gpu, Then divide equally among others gpu).
Such problems :
- Unbalanced load :GPU0 As master To manage all kinds of data , Its video memory and utilization rate will be higher than others . Use
watch -n 1 nvidia-smi
Observe gpu Usage situation , First block gpu The upper video memory occupation is seriously larger than others gpu.- Network communication It will be called a bottleneck , And the whole GPU Low usage
- Synchronization is not supported BN. hypothesis batch_size=8, The effect of training with two graphics cards , It's worse than using a single card batch_size=16 The effect of , After all, here bn The statistical parameters of are based on those on the single card batch Calculated ( But the effect is definitely better than a single card batch_size=8)
- Want to designate a graphics card for training , For example, when the number is 1、2 Video card training , The following settings are required
If the following settings are used , Part of the operation is In the device 0 On the No. 1 graphics card , If at this time 0 No. 1 graphics card is full , The program will not work properlyos.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # according to PCI_BUS_ID Order from 0 Start to line up GPU equipment os.environ["CUDA_VISIBLE_DEVICES"] ="1, 2" # Set the currently used GPU The equipment is 1,2 Two devices , The names are in the following order '/gpu:0'、'/gpu:1'.# Indicates priority to use 1 Equipment No , And then use 2 Equipment No ... gpus = [0, 1] # Corresponding The number of the graphics card in the device is 1、2 net = torch.nn.DataParallel(net.cuda(), device_ids=gpus) # among output_device The default is gpus[0]=0, That is, the graphics card in the device 1
gpus = [1, 2] # Corresponding The number of the graphics card in the device is 1、2 net = torch.nn.DataParallel(net.cuda(), device_ids=gpus) # among output_device The default is gpus[0]=1, At this time, it corresponds to the graphics card in the device 0
Code using
import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"] ="1, 2" # Put it in import torch Before import torch import torch distributed as dist gpus = [0, 1] torch.cuda.set_device("cuda".format(gpus[0])) train_dataset = ... train_loader = torch.utils.data.DataLoader(trian_dataset, ...) model = ... model = nn.DataParallel(model.to(device), device_ids=gpus) optimizer = optim.SGD(model.parameters()) # Be careful , Other network settings , To be in nn.DataParallel after for epoch in range(1000): for batch_idx, (data, target) in enumerate(train_loader): images, target = images.cuda(non_blocking=True), target.cuda(non_blocking=True) ... output = model(image) loss = criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step()
API explain
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank])
stay 1.0 after , The official encapsulates the common methods of distribution , Support all-reduce、broadcast、send、receive wait . adopt MPI Realized CPU signal communication , adopt NCCL Realized GPU Communication for .
It's solvedDataParallel
Slow speed ,GPU The problem of unbalanced load .Operating mechanism
- Independent data loading :
Each process loads its own data , Load from disk into the page locking memory on the host , Use multiple worker processes to load data in parallel . At the same time, a small batch of data is transferred from the page locking memory to each GPU. No need for data broadcasting 、 There is no need for model broadcasting ( Every gpu There is a copy of the same model ).
among , Distributed data sampler (DistributedSampler
) It ensures that the loaded data does not overlap between processes- Every gpu Upper forward、backward
Every gpu Independent forward communication , Calculate the output of the network ;
Every gpu Independent computing loss, Reverse the gradient calculation ;
Each process needs to aggregate and average the gradients , Then fromrank=0
The process of , Put it broadcast To all processes- Update model parameters
Every gpu With the same gradient Update parameters independently . Because of every gpu It all starts with the same copy of the model , The initial parameters are the same , And the descending gradient is the same , So all gpu The weight updates on are the same . Therefore, model synchronization is not requiredOther instructions
It uses Multiprocess control . After writing the corresponding code ,torch It will be automatically assigned to n Go ahead , Respectively in that GPU Up operation . There is no lord GPU, Every GPU Perform the same task .
Different from single process training , Multi process training needs to pay attention to the following :
- Tell each process which block it uses GPU(
)- Initialize distributed
- Data distributed transmission
A complete batch Divided into multiple processes , Ensure that the loaded data does not overlap between processes (DistributedSampler
,DistributedSampler), Train in every epoch Then the data is scrambledtrain_sampler.set_epoch(epoch)
- Build a model Design loss, Then the network
、 structure DDP Model
- Use BN Are the benefits of : The training is normalized inside the network , It provides regularization for the training process , Prevent the middle layer feature map Covariance shift of , Helps suppress overfitting . Use BN, There is no need to rely particularly on initialization parameters , Can use a larger learning rate , Therefore, the training process of the model can be accelerated .
- existing api Medium Batch Normalization It implements the single card mode , It is to normalize the samples on a single card . When we use multi card training ,4 The total number of cards is batch_size by 32, But actually bn The parameter in is still correct 8 Samples completed . If the batch size on a single card is small , It will affect the convergence effect of the model .
- Cross card synchronization Batch Normalization Global samples can be used for normalization , This is what you really want to increase the number of cards batch_size How to train . Use cross card BN It will significantly improve the experimental effect .
- Optimizer settings
In each iteration , Each process has its ownoptimizer
, And independently complete all the optimization steps , The process is consistent with general training . Each process corresponds to an independent training process , Only a small amount of data such as gradient can be exchanged .
in , Maintain one throughout the process optimizer, Then sum the gradient , Then in the Lord gpu Update parameters on , Then broadcast the updated parameters to other gpu On . comparison , The former transmits less data , So it's faster , More efficient- Training
- Record loss
When using multiprocessing , Each process has its own calculation loss, When recording data I hope that for different processes loss Take the average , Other data also want to average . You need to use api as follows , See the source code for detailsdef all_reduce(tensor, op=ReduceOp.SUM, group=group.WORLD, async_op=False): """ Reduces the tensor data across all machines in such a way that all get the final result. """
- Model preservation
Due to the use DDP after , The model is in each GPU A copy has been made on , At the same time, it is encapsulated with a layer . So when saving the model, you only need to save master The model of the node , And put the usualmodel
, As follows :When loading the model , You just need to construct DDP Before the model , stay master Load on node :if dist.get_rank()==0: torch.save(model.module.state_dict(), "{}.ckpt".format(str(epoch)))
if dist.get_rank() == 0 and ckpt_path is not None: model.load_state_dict(torch.load(ckpt_path))
- Each process contains an independent interpreter and GIL
Commonly used Python Interpreter CPython: Yes, it is C Language implementation Python, It is currently the most widely used interpreter . Global lock makes Python Poor performance in multithreading effect . Global interpreter lock (Global Interpreter Lock) yes Python Tools for synchronizing threads , Make only one thread execute at any time .
Because each process has its own interpreter and GIL, Eliminates data from a single Python Multiple execution threads in the process , A copy of the model or GPU Additional interpreter overhead 、 Thread bump , Therefore, the interpreter and GIL Use conflict . This is for heavy dependence Python runtime Of models for , For example, include RNN Layers or a large number of small components models for , This is particularly important .Usage mode ---- Code writing
import os import argparse import torch import torch.nn as nn import torch.distributed as dist def parse(): parser = argparse.ArgumentParser() parser.add_argument('--local_rank', type=int, default=0) args = parser.parse_args() return args def reduce_tensor(tensor): rt = tensor.clone() dist.all_reduce(rt, op=dist.reduce_op.SUM) rt /= dist.get_world_size() return rt def record_loss(loss): reduced_loss = reduce_tensor(loss.data) train_epoch_loss += reduced_loss.item() # Pay attention to writing TensorBoard It's enough to let only one process write when : # TensorBoard if args.local_rank == 0: writer.add_scalars('Loss/training', { 'train_loss': train_epoch_loss, 'val_loss': val_epoch_loss}, epoch + 1) def main(): """============================================================= When the starter starts python After script , It's going to go through the parameters local_rank To tell the current process which is used GPU, Used to specify different in each process device ================================================================""" args = parse() torch.cuda.set_device(args.local_rank) dist.init_process_group( 'nccl', # initialization GPU communication mode (NCCL) init_menthod='env://' # How to obtain parameters (env Represents through the environment variable ) ) """============================================================= Distributed data reading , Specific usage , Reference resources https://blog.csdn.net/magic_ll/article/details/123294552 ================================================================""" train_dataset = ... train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler) """======= Call of distributed model : Include SynBN========================================""" model = ... model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank]) optimizer = optim.SGD(model.parameters()) """======= Training ====================================================""" for epoch in range(100): train_sampler.set_epoch(epoch) for batch_idx, (data, target) in enumerate(train_loader): images = images.cuda(non_blocking=True) target = target.cuda(non_blocking=True) ... output = model(images) loss = criterion(output, target) ... optimizer.zero_grad() loss.backward() optimizer.step() record_loss(loss)
How to start the code
- In terms of multi process startup , Don't write by yourself multiprocess Carry out a series of complex CPU/GPU Assigned tasks ,PyTorch It provides a very convenient starter torch.distributed.launch Used to start the file , Therefore, the way to run the training code is as follows :
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py
- How to realize the bottom layer of read-write lock in go question bank 16
- Lean thinking: source, pillar, landing. I understand it after reading this article
- Today, with the popularity of micro services, how does service mesh exist?
- More information about M91 fast hall measuring instrument
- 宝,运维100+服务器很头疼怎么办?用行云管家!
- 11. Users, groups, and permissions (1)
- Lumiprobe cell imaging study PKH26 cell membrane labeling kit
- 【快应用】text组件里的文字很多,旁边的div样式会被拉伸如何解决
- The former 4A executives engaged in agent operation and won an IPO
- 2. Create your own NFT collections and publish a Web3 application to show them start and run your local environment
毕业季 | 华为专家亲授面试秘诀:如何拿到大厂高薪offer?
Games202 operation 0 - environment building process & solving problems encountered
The market value evaporated by 74billion yuan, and the big man turned and entered the prefabricated vegetables
Lake Shore M91快速霍尔测量仪
How to use the low code platform of the Internet of things for personal settings?
Bao, what if the O & M 100+ server is a headache? Use Xingyun housekeeper!
Supervarimag superconducting magnet system SVM series
Solution of intelligent supply chain management platform in aquatic industry: support the digitalization of enterprise supply chain and improve enterprise management efficiency
Lake shore M91 fast hall measuring instrument
ES6 summary "suggestions collection" of array methods find(), findindex()
【pytorch记录】自动混合精度训练 torch.cuda.amp
JS find the next adjacent element of the number in the array
Viewing technological changes through Huawei Corps (VI): smart highway
Lake Shore - crx-em-hf low temperature probe station
从零开始学 MySQL —数据库和数据表操作
华为云专家详解GaussDB(for MySQL)新特性
Shell array
[live broadcast appointment] database obcp certification comprehensive upgrade open class
The best landing practice of cave state in an Internet ⽹⾦ financial technology enterprise
[6.24-7.1] review of wonderful technical blog posts in the writing community
Lake shore M91 fast hall measuring instrument
Example explanation: move graph explorer to jupyterlab
How to use the low code platform of the Internet of things for personal settings?