当前位置：网站首页>Pytoch distributed training

Pytoch distributed training

2022-07-27 21:53:00 【Luo Lei】

High quality resource sharing

Learning route guidance （ Click unlock ）	Knowledge orientation	Crowd positioning
🧡 Python Actual wechat ordering applet 🧡	Progressive class	This course is python flask+ Perfect combination of wechat applet , From the deployment of Tencent to the launch of the project , Create a full stack ordering system .
Python Quantitative trading practice	beginner	Take you hand in hand to create an easy to expand 、 More secure 、 More efficient quantitative trading system

The era of training model with single machine and single card has passed , Single machine multi card has become the mainstream configuration . How to maximize the role of multi card ？ In this paper, Pytorch Medium DistributedDataParallel Method .

1. DataParallel

Actually Pytorch There have long been tools for data parallelism DataParallel, It realizes data parallelism through single process and multithreading .

Simply speaking ,DataParallel There is a concept of parameter server , The thread where the parameter server is located will accept the gradient and parameters returned by other threads , Update parameters after integration , Then send the updated parameters back to other threads , Here is a one to many two-way transmission . because Python Language has GIL Limit , So this method is not efficient , For example, in fact 4 The card may only have 2～3 Double the speed .

2. DistributedDataParallel

Pytorch At present, it provides a more efficient implementation , That is to say DistributedDataParallel. Compared with DataParallel There is an additional concept of distribution . First DistributedDataParallel It can realize multi machine and multi card training , But considering that most users don't have the environment of multiple computers and multiple cards , This blog mainly introduces the usage of single machine and multi card .

In principle ,DistributedDataParallel Using multiple processes , Avoided python Low efficiency of multithreading . Generally speaking , Every GPU All run in a separate process , Each process calculates the gradient independently .

meanwhile DistributedDataParallel The problem of one to many transmission and synchronization in parameter server is abandoned , Instead, a circular gradient transfer is used , The legend on Zhihu is quoted here . This ring synchronization makes each GPU Only need to communicate with your upstream and downstream GPU Carry out gradient transfer between processes , It avoids the information blocking that may occur when the parameter server is one to many .

ring

3. DistributedDataParallel Example

The following is a very simplified example of single machine multi card , It is divided into six steps to realize single machine multi card training .

First step , First, import the relevant packages .



|  | import argparse |
|  | import torch.distributed as dist |
|  | from torch.nn.parallel import DistributedDataParallel as DDP |

The second step , Add a parameter ,local_rank. It's easier to understand , It is equivalent to telling the current program where to run GPU On , That is, the third line of code below .local_rank It's through pytorch A startup script from , The script will be explained later . The last sentence is to specify the mode of communication , This choice nccl Just go .



|  | parser = argparse.ArgumentParser() |
|  | parser.add\_argument("--local\_rank", default=-1, type=int) |
|  | args = parser.parse\_args() |
|  |  |
|  | torch.cuda.set\_device(args.local\_rank) |
|  |  |
|  | dist.init\_process\_group(backend='nccl') |

The third step , packing Dataloader. What is needed here is to sampler Change it to DistributedSampler, Then give to DataLoader Inside sampler.

Why do you need to do this ？ Because of every GPU, Or every process will start from DataLoader Get data inside , Appoint DistributedSampler Can make every GPU Get non overlapping data .

Readers may be curious , Specified below batch_size by 24, This means that every GPU Will be assigned to 24 Data , Or all GPU Divide this equally 24 Data ？ The answer is , Every GPU At every iter You'll get 24 Data , If you are 4 card , One iter A total of 24*4=96 Data .



|  | train\_sampler = torch.utils.data.distributed.DistributedSampler(my\_trainset) |
|  |  |
|  | trainloader = torch.utils.data.DataLoader(my\_trainset,batch\_size=24,num\_workers=4,sampler=train\_sampler) |

Step four , Use DDP Packaging model .device_id Still args.local_rank.



|  | model = DDP(model, device\_ids=[args.local\_rank]) |

Step five , Put the input data into the specified GPU. The backward and forward propagation is the same as before .



|  | for imgs,labels in trainloader: |
|  |  |
|  |  imgs=imgs.to(args.local\_rank) |
|  |  labels=labels.to(args.local\_rank) |
|  |  |
|  |  optimizer.zero\_grad() |
|  |  output=net(imgs) |
|  |  loss\_data=loss(output,labels) |
|  |  loss\_data.backward() |
|  |  optimizer.step() |

Step six , Start training .torch.distributed.launch It's the startup script ,nproc_per_node yes GPU Count .



|  | python -m torch.distributed.launch --nproc\_per\_node 2 main.py |

Pass the above six steps , We let the model run on a single multi card . Isn't it so troublesome , But it's better than DataParallel More complicated , Considering the acceleration effect , Try it .

4. DistributedDataParallel Be careful

DistributedDataParallel It is executed in multi process mode , Then some operations need to be careful . If you write a line in the code print, And use 4 Card training , Then you will see four lines on the console print. We only want to see one line , What should I do ？
Add a judgment like the following , there get_rank() What you get is the identification of the process , So the output operation will only be in the process 0 In the implementation of .



|  | if dist.get\_rank() == 0: |
|  | print("hah") |

You will often need dist.get_rank() Of . Because there are many operations that only need to be performed in one process , For example, save the model , Without the above judgment , All four processes can write models , Write errors may occur ; in addition load When pre training model weights , Judgment should also be added , only load once ; And like output loss Wait for some scenes .

【 Reference resources 】https://zhuanlan.zhihu.com/p/178402798

原网站

版权声明
本文为[Luo Lei]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/199/202207151341348584.html