当前位置:网站首页>Pytoch distributed training
Pytoch distributed training
2022-07-27 21:53:00 【Luo Lei】
High quality resource sharing
| Learning route guidance ( Click unlock ) | Knowledge orientation | Crowd positioning |
|---|---|---|
| 🧡 Python Actual wechat ordering applet 🧡 | Progressive class | This course is python flask+ Perfect combination of wechat applet , From the deployment of Tencent to the launch of the project , Create a full stack ordering system . |
| Python Quantitative trading practice | beginner | Take you hand in hand to create an easy to expand 、 More secure 、 More efficient quantitative trading system |
The era of training model with single machine and single card has passed , Single machine multi card has become the mainstream configuration . How to maximize the role of multi card ? In this paper, Pytorch Medium DistributedDataParallel Method .
1. DataParallel
Actually Pytorch There have long been tools for data parallelism DataParallel, It realizes data parallelism through single process and multithreading .
Simply speaking ,DataParallel There is a concept of parameter server , The thread where the parameter server is located will accept the gradient and parameters returned by other threads , Update parameters after integration , Then send the updated parameters back to other threads , Here is a one to many two-way transmission . because Python Language has GIL Limit , So this method is not efficient , For example, in fact 4 The card may only have 2~3 Double the speed .
2. DistributedDataParallel
Pytorch At present, it provides a more efficient implementation , That is to say DistributedDataParallel. Compared with DataParallel There is an additional concept of distribution . First DistributedDataParallel It can realize multi machine and multi card training , But considering that most users don't have the environment of multiple computers and multiple cards , This blog mainly introduces the usage of single machine and multi card .
In principle ,DistributedDataParallel Using multiple processes , Avoided python Low efficiency of multithreading . Generally speaking , Every GPU All run in a separate process , Each process calculates the gradient independently .
meanwhile DistributedDataParallel The problem of one to many transmission and synchronization in parameter server is abandoned , Instead, a circular gradient transfer is used , The legend on Zhihu is quoted here . This ring synchronization makes each GPU Only need to communicate with your upstream and downstream GPU Carry out gradient transfer between processes , It avoids the information blocking that may occur when the parameter server is one to many .

3. DistributedDataParallel Example
The following is a very simplified example of single machine multi card , It is divided into six steps to realize single machine multi card training .
First step , First, import the relevant packages .
| | import argparse |
| | import torch.distributed as dist |
| | from torch.nn.parallel import DistributedDataParallel as DDP |
The second step , Add a parameter ,local_rank. It's easier to understand , It is equivalent to telling the current program where to run GPU On , That is, the third line of code below .local_rank It's through pytorch A startup script from , The script will be explained later . The last sentence is to specify the mode of communication , This choice nccl Just go .
| | parser = argparse.ArgumentParser() |
| | parser.add\_argument("--local\_rank", default=-1, type=int) |
| | args = parser.parse\_args() |
| | |
| | torch.cuda.set\_device(args.local\_rank) |
| | |
| | dist.init\_process\_group(backend='nccl') |
The third step , packing Dataloader. What is needed here is to sampler Change it to DistributedSampler, Then give to DataLoader Inside sampler.
Why do you need to do this ? Because of every GPU, Or every process will start from DataLoader Get data inside , Appoint DistributedSampler Can make every GPU Get non overlapping data .
Readers may be curious , Specified below batch_size by 24, This means that every GPU Will be assigned to 24 Data , Or all GPU Divide this equally 24 Data ? The answer is , Every GPU At every iter You'll get 24 Data , If you are 4 card , One iter A total of 24*4=96 Data .
| | train\_sampler = torch.utils.data.distributed.DistributedSampler(my\_trainset) |
| | |
| | trainloader = torch.utils.data.DataLoader(my\_trainset,batch\_size=24,num\_workers=4,sampler=train\_sampler) |
Step four , Use DDP Packaging model .device_id Still args.local_rank.
| | model = DDP(model, device\_ids=[args.local\_rank]) |
Step five , Put the input data into the specified GPU. The backward and forward propagation is the same as before .
| | for imgs,labels in trainloader: |
| | |
| | imgs=imgs.to(args.local\_rank) |
| | labels=labels.to(args.local\_rank) |
| | |
| | optimizer.zero\_grad() |
| | output=net(imgs) |
| | loss\_data=loss(output,labels) |
| | loss\_data.backward() |
| | optimizer.step() |
Step six , Start training .torch.distributed.launch It's the startup script ,nproc_per_node yes GPU Count .
| | python -m torch.distributed.launch --nproc\_per\_node 2 main.py |
Pass the above six steps , We let the model run on a single multi card . Isn't it so troublesome , But it's better than DataParallel More complicated , Considering the acceleration effect , Try it .
4. DistributedDataParallel Be careful
DistributedDataParallel It is executed in multi process mode , Then some operations need to be careful . If you write a line in the code print, And use 4 Card training , Then you will see four lines on the console print. We only want to see one line , What should I do ?
Add a judgment like the following , there get_rank() What you get is the identification of the process , So the output operation will only be in the process 0 In the implementation of .
| | if dist.get\_rank() == 0: |
| | print("hah") |
You will often need dist.get_rank() Of . Because there are many operations that only need to be performed in one process , For example, save the model , Without the above judgment , All four processes can write models , Write errors may occur ; in addition load When pre training model weights , Judgment should also be added , only load once ; And like output loss Wait for some scenes .
【 Reference resources 】https://zhuanlan.zhihu.com/p/178402798
边栏推荐
- 华为成立全球生态发展部:全力推进HMS全球生态建设
- 2021-11-05理解main方法语法、代码块及final关键字
- Lvs+kept highly available cluster
- Software testing interview question: what aspects should be considered when designing test cases, that is, which aspects should different test cases be tested for?
- Software test interview question: please say who is the best person to complete these tests, and what is the test?
- 腾讯云[HiFlow】| 自动化 -------HiFlow:还在复制粘贴?
- 异常-Exception
- Box model and element positioning
- Qmodbus library is used, and it is written as ROS node publishing topic and program cmakelist
- OPPO造芯计划正式公布:首款芯片或为OPPO M1
猜你喜欢

MySQL execution process and order

LM NAV: robot navigation method based on large models of language, vision and behavior

怎么还有人问 MySQL 是如何归档数据的呢?

Is log4j vulnerability still widespread?

How long will it take to learn the four redis cluster solutions? I'll finish it for you in one breath

University of Tilburg, Federal University of the Netherlands | neural data to text generation based on small datasets: comparing the added value of two semi supervised learning approvals on top of a l

Log4j vulnerability is still widespread and continues to cause impact

B站崩了,如果我们是那晚负责修复的开发人员

Daily Mathematics Series 60: February 29

单核CPU, 1G内存,也能做JVM调优吗?
随机推荐
美国将禁止所有中国企业采购美国芯片?特朗普这样回应
Will the United States prohibit all Chinese enterprises from purchasing American chips? Trump responded like this
day 1 - day 4
Shengyang technology officially launched the remote voiceprint health return visit service system!
Can JVM tuning be done with single core CPU and 1G memory?
The design idea of relational database is obvious to you in 20 pictures
day 1 - day 4
软件测试面试题:软件测试项目从什么时候开始?为什么?
MySQL data recovery process is based on binlog redolog undo
Software test interview question: does software acceptance test include formal acceptance test, alpha test and beta test?
Under the epidemic, the mobile phone supply chain and offline channels are blocked! Sales plummeted and inventory was serious!
Daily Mathematics Series 60: February 29
Cocoapods reload
Excalidraw: an easy-to-use online, free "hand drawn" virtual whiteboard tool
@The difference between Autowired annotation and @resource annotation
Search, insert and delete of hash table
Array expansion, sorting, nested statement application
C language - Introduction - grammar - pointer (12)
Software testing interview question: what aspects should be considered when designing test cases, that is, which aspects should different test cases be tested for?
Read Plato farm's eplato and the reason for its high premium