当前位置:网站首页>Pytoch distributed training
Pytoch distributed training
2022-07-27 21:53:00 【Luo Lei】
High quality resource sharing
| Learning route guidance ( Click unlock ) | Knowledge orientation | Crowd positioning |
|---|---|---|
| 🧡 Python Actual wechat ordering applet 🧡 | Progressive class | This course is python flask+ Perfect combination of wechat applet , From the deployment of Tencent to the launch of the project , Create a full stack ordering system . |
| Python Quantitative trading practice | beginner | Take you hand in hand to create an easy to expand 、 More secure 、 More efficient quantitative trading system |
The era of training model with single machine and single card has passed , Single machine multi card has become the mainstream configuration . How to maximize the role of multi card ? In this paper, Pytorch Medium DistributedDataParallel Method .
1. DataParallel
Actually Pytorch There have long been tools for data parallelism DataParallel, It realizes data parallelism through single process and multithreading .
Simply speaking ,DataParallel There is a concept of parameter server , The thread where the parameter server is located will accept the gradient and parameters returned by other threads , Update parameters after integration , Then send the updated parameters back to other threads , Here is a one to many two-way transmission . because Python Language has GIL Limit , So this method is not efficient , For example, in fact 4 The card may only have 2~3 Double the speed .
2. DistributedDataParallel
Pytorch At present, it provides a more efficient implementation , That is to say DistributedDataParallel. Compared with DataParallel There is an additional concept of distribution . First DistributedDataParallel It can realize multi machine and multi card training , But considering that most users don't have the environment of multiple computers and multiple cards , This blog mainly introduces the usage of single machine and multi card .
In principle ,DistributedDataParallel Using multiple processes , Avoided python Low efficiency of multithreading . Generally speaking , Every GPU All run in a separate process , Each process calculates the gradient independently .
meanwhile DistributedDataParallel The problem of one to many transmission and synchronization in parameter server is abandoned , Instead, a circular gradient transfer is used , The legend on Zhihu is quoted here . This ring synchronization makes each GPU Only need to communicate with your upstream and downstream GPU Carry out gradient transfer between processes , It avoids the information blocking that may occur when the parameter server is one to many .

3. DistributedDataParallel Example
The following is a very simplified example of single machine multi card , It is divided into six steps to realize single machine multi card training .
First step , First, import the relevant packages .
| | import argparse |
| | import torch.distributed as dist |
| | from torch.nn.parallel import DistributedDataParallel as DDP |
The second step , Add a parameter ,local_rank. It's easier to understand , It is equivalent to telling the current program where to run GPU On , That is, the third line of code below .local_rank It's through pytorch A startup script from , The script will be explained later . The last sentence is to specify the mode of communication , This choice nccl Just go .
| | parser = argparse.ArgumentParser() |
| | parser.add\_argument("--local\_rank", default=-1, type=int) |
| | args = parser.parse\_args() |
| | |
| | torch.cuda.set\_device(args.local\_rank) |
| | |
| | dist.init\_process\_group(backend='nccl') |
The third step , packing Dataloader. What is needed here is to sampler Change it to DistributedSampler, Then give to DataLoader Inside sampler.
Why do you need to do this ? Because of every GPU, Or every process will start from DataLoader Get data inside , Appoint DistributedSampler Can make every GPU Get non overlapping data .
Readers may be curious , Specified below batch_size by 24, This means that every GPU Will be assigned to 24 Data , Or all GPU Divide this equally 24 Data ? The answer is , Every GPU At every iter You'll get 24 Data , If you are 4 card , One iter A total of 24*4=96 Data .
| | train\_sampler = torch.utils.data.distributed.DistributedSampler(my\_trainset) |
| | |
| | trainloader = torch.utils.data.DataLoader(my\_trainset,batch\_size=24,num\_workers=4,sampler=train\_sampler) |
Step four , Use DDP Packaging model .device_id Still args.local_rank.
| | model = DDP(model, device\_ids=[args.local\_rank]) |
Step five , Put the input data into the specified GPU. The backward and forward propagation is the same as before .
| | for imgs,labels in trainloader: |
| | |
| | imgs=imgs.to(args.local\_rank) |
| | labels=labels.to(args.local\_rank) |
| | |
| | optimizer.zero\_grad() |
| | output=net(imgs) |
| | loss\_data=loss(output,labels) |
| | loss\_data.backward() |
| | optimizer.step() |
Step six , Start training .torch.distributed.launch It's the startup script ,nproc_per_node yes GPU Count .
| | python -m torch.distributed.launch --nproc\_per\_node 2 main.py |
Pass the above six steps , We let the model run on a single multi card . Isn't it so troublesome , But it's better than DataParallel More complicated , Considering the acceleration effect , Try it .
4. DistributedDataParallel Be careful
DistributedDataParallel It is executed in multi process mode , Then some operations need to be careful . If you write a line in the code print, And use 4 Card training , Then you will see four lines on the console print. We only want to see one line , What should I do ?
Add a judgment like the following , there get_rank() What you get is the identification of the process , So the output operation will only be in the process 0 In the implementation of .
| | if dist.get\_rank() == 0: |
| | print("hah") |
You will often need dist.get_rank() Of . Because there are many operations that only need to be performed in one process , For example, save the model , Without the above judgment , All four processes can write models , Write errors may occur ; in addition load When pre training model weights , Judgment should also be added , only load once ; And like output loss Wait for some scenes .
【 Reference resources 】https://zhuanlan.zhihu.com/p/178402798
边栏推荐
- 一篇文章带你走进pycharm的世界----别再问我pycharm的安装和环境配置了!!!
- 除了「加机器」,其实你的微服务还能这样优化
- Small change project (two versions) with detailed ideas
- In crsctl, the function of displayed home
- @Component可以和@Bean 用在同一个类上吗?
- Software testing interview question: when does the software testing project start? Why?
- Station B collapsed. If we were the developer responsible for the repair that night
- 枚举和注解
- 微软商店无法下载应用,VS2019无法下载插件问题解决方案
- 2021-11-05理解main方法语法、代码块及final关键字
猜你喜欢

零钱通项目(两个版本)含思路详解

University of Tilburg, Federal University of the Netherlands | neural data to text generation based on small datasets: comparing the added value of two semi supervised learning approvals on top of a l

In depth understanding of recursive method calls (including instance maze problem, tower of Hanoi, monkey eating peach, fiboracci, factorial))

What is eplato cast by Plato farm on elephant swap? Why is there a high premium?

Talk about MySQL transaction two-phase commit

Box model and element positioning

LInkedList底层源码

Exception -exception

异常-Exception

监听服务器jar运行,及重启脚本
随机推荐
Nano semiconductor 65W gallium nitride (GAN) scheme was adopted by Xiaomi 10 Pro charger
哈希表的查找与插入及删除
Common shortcut keys and setting methods of idea
Software test interview question: does software acceptance test include formal acceptance test, alpha test and beta test?
看起来是线程池的BUG,但是我认为是源码设计不合理。
JVM-内存模型 面试总结
一篇文章带你走进pycharm的世界----别再问我pycharm的安装和环境配置了!!!
Software testing interview question: how many strategies are there for system testing?
Software testing interview question: what is the focus of unit testing, integration testing, and system testing?
day 1 - day 4
Software testing interview question: what project documents need to be referred to in designing the system test plan?
Simple manual implementation of map
软件测试面试题:系统测试的策略有多少种?
Form of objects in memory & memory allocation mechanism
STL源码剖析
Read Plato farm's eplato and the reason for its high premium
@The difference between Autowired annotation and @resource annotation
递归/回溯刷题(上)
After sorting (bubble sorting), learn to continuously update other sorting methods
Who is the sanctity of the six Chinese enterprises newly sanctioned by the United States?