当前位置:网站首页>Implementation of DDP cluster distributed training under pytoch multi GPU conditions (brief introduction - from scratch)
Implementation of DDP cluster distributed training under pytoch multi GPU conditions (brief introduction - from scratch)
2022-07-29 06:57:00 【Wait for Godot.】
List of articles
Preface
I've been trying for two days Pytorch More in the environment GPU Model training , Summarize a note that can be completely realized from scratch . It took one night plus one noon , Finally succeeded . This is recorded here , Convenient for future reference .
Relevant concepts
Thank you very much :「 Freshman Handbook 」:PyTorch Distributed training
- group: Process group , In most cases DDP The processes of are under the same process group .
- world_size: Total number of processes , ( In principle, one process Take up one GPU Is better ), So it can be understood as GPU number .
- rank: Sequence number of the current process , For interprocess communication ,rank = 0 The host of is master node .
- local_rank: Corresponding to the current process GPU Number .
Corresponding examples
- stand-alone 8 Card distributed training . At this moment world size = 8, That is to say 8 A process , Its rank The numbers are 0-7, and local_rank Also for the 0-7.( Note in the case of single machine multitasking CUDA_VISIBLE_DEVICES The use of controls visible to different programs GPU devices)
- Two 16 Card distributed training . At this time, each machine is 8 card , in total 16 card ,world_size = 16, That is to say 16 A process , Its rank The number is 0-15, But on every machine ,local_rank still 0-7, This is a local rank And rank The difference between , local rank It will correspond to the actual GPU ID On .
Single machine many GPU Implementation details
preparation
In the realization of many GPU Before training , You need to ensure your train.py It can run completely , It includes the following modules :
- dataset modular
- model modular
- loss modular
- optimizer modular
- log modular
- Model saving module
- Model loading module
relevant DDP Package import :
import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.nn.parallel import DistributedDataParallel as DDP
Code related processes :
DDP The basic usage of ( The coding process )
- Use
torch.distributed.init_process_groupInitialize process group - Use
torch.nn.parallel.DistributedDataParallelestablish Distributed model - Use
torch.utils.data.distributed.DistributedSamplerestablish DataLoader - Adjust other necessary places (tensor Put to designated device On ,S/L checkpoint, Index calculation, etc )
- Use
torchrunStart training
1. Initialize process group
Set the following function :
def init_distributed_mode(args):
# set up distributed device
args.rank = int(os.environ["RANK"])
args.local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(args.rank % torch.cuda.device_count())
dist.init_process_group(backend="nccl")
args.device = torch.device("cuda", args.local_rank)
print(args.device,'argsdevice')
args.NUM_gpu = torch.distributed.get_world_size()
print(f"[init] == local rank: {
args.local_rank}, global rank: {
args.rank} ==")
In the main function train.py in , Initialize the process group :
Be careful : The learning rate also increases GPU The number of changes .
# Initialize Multi GPU
if args.multi_gpu == True :
init_distributed_mode(args)
else:
# Use Single Gpu
os.environ['CUDA_VISIBLE_DEVICES'] = args.device_gpu
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {
device} device')
args.device = device
#The learning rate is automatically scaled
# (in other words, multiplied by the number of GPUs and multiplied by the batch size divided by 32).
args.lr = args.lr * args.NUM_gpu * (args.batch_size / 32)
2. Create a distributed model
Load well model After module , Create a distributed model :
model = model.cuda()
if args.multi_gpu:
# DistributedDataParallel
ssd300 = DDP(model , device_ids=[args.local_rank], output_device=args.local_rank)
3. establish Dataloader ( With the first 2 There is no sequence of steps )
train_dataset = COCODetection(root=args.data.DATASET_PATH,image_set='train2017',
transform=SSDTransformer(dboxes))
val_dataset = COCODetection(root=args.data.DATASET_PATH,image_set='val2017',
transform=SSDTransformer(dboxes, val=True))
if args.multi_gpu:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,shuffle=True)
val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
train_shuffle = False
else:
train_sampler = None
val_sampler = None
train_shuffle = True
train_loader = torch.utils.data.DataLoader(train_dataset, args.batch_size,
num_workers=args.num_workers,
shuffle=train_shuffle,
sampler=train_sampler,
pin_memory=True)
val_loader = torch.utils.data.DataLoader(val_dataset,
batch_size=args.batch_size,
shuffle=False, # Note: distributed sampler is shuffled :(
sampler=val_sampler,
num_workers=args.num_workers)
4. Some precautions : ( Should not be neglected )
Saving the model , Or record Log When you file , Be sure to judge in advance whether it is on the main thread , namely args.local_rank == 0, Otherwise, it will be recorded or saved repeatedly .
if args.local_rank == 0:
log.logger.info(epoch, acc)
# Save model
if args.save and args.local_rank == 0:
print("saving model...")
5. Shell File execution
For single machine with multiple cards :
newly build multi_gpu.sh file ,
#exmaple: 1 node, 2 GPUs per node (2GPUs)
CUDA_VISIBLE_DEVICES=3,4 torchrun \
--nproc_per_node=2 \
--nnodes=1 \
--node_rank=0 \
--master_addr=localhost \
--master_port=22222 \
train.py --multi_gpu=True
Simply explain the parameters :
–nproc_per_node It refers to the number of processes in each stage , Every machine here 2 card , So it is 2
–nnodes Number of nodes , Here are two machines , So it is 1
–node_rank node rank, For the first machine 0, The second machine is 1, There is only one machine here , Namely 0 了 .
–master_addr The master node ip, Here I fill in the local of the first machine ip,localhost, In case of multiple machines, you need to fill in the LAN corresponding to the machine IP, There is no condition to try this kind of multi machine situation .
–master_port The port number of the primary node , Just give it whatever you want ( Useless port ).
follow-up work
The following sample code will be synchronized to my template library ,Templete, You can go to see if you are interested .
of Batch_size Some of the settings for
because DistributedDataParallel Is at the end of each GPU There is a new process , So it's set at this time batch size It actually means a single GPU above batch size size . for instance , Used 2 Servers , Each server uses 8 Zhang GPU, then batch size Set up in order to 32, So practical batch size by 3282=512, So actually batch size You didn't set it up batch size.
Reference and thanks
「 Freshman Handbook 」:PyTorch Distributed training
pytorch many gpu Parallel training
appendix :
argparse Reference resources :
There are many useless , Find useful references , Here are all copied .
parser = argparse.ArgumentParser(description='Train Single Shot MultiBox Detector on COCO')
parser.add_argument('--model_name', default='SSD300', type=str,
help='The model name')
parser.add_argument('--model_config', default='configs/SSD300.yaml',
metavar='FILE', help='path to model cfg file', type=str,)
parser.add_argument('--data_config', default='data/coco.yaml',
metavar='FILE', help='path to data cfg file', type=str,)
parser.add_argument('--device_gpu', default='3,4', type=str,
help='Cuda device, i.e. 0 or 0,1,2,3')
parser.add_argument('--checkpoint', default=None, help='The checkpoint path')
parser.add_argument('--save', type=str, default='checkpoints',
help='save model checkpoints in the specified directory')
parser.add_argument('--mode', type=str, default='training',
choices=['training', 'evaluation', 'benchmark-training', 'benchmark-inference'])
parser.add_argument('--epochs', '-e', type=int, default=65,
help='number of epochs for training')
parser.add_argument('--evaluation', nargs='*', type=int, default=[21, 31, 37, 42, 48, 53, 59, 64],
help='epochs at which to evaluate')
parser.add_argument('--multistep', nargs='*', type=int, default=[43, 54],
help='epochs at which to decay learning rate')
parser.add_argument('--warmup', type=int, default=None)
parser.add_argument('--seed', '-s', default = 42 , type=int, help='manually set random seed for torch')
# Hyperparameters
parser.add_argument('--lr', type=float, default=2.6e-3,
help='learning rate for SGD optimizer')
parser.add_argument('--momentum', '-m', type=float, default=0.9,
help='momentum argument for SGD optimizer')
parser.add_argument('--weight_decay', '--wd', type=float, default=0.0005,
help='weight-decay for SGD optimizer')
parser.add_argument('--batch_size', '--bs', type=int, default=64,
help='number of examples for each iteration')
parser.add_argument('--num_workers', type=int, default=8)
parser.add_argument('--backbone', type=str, default='resnet50',
choices=['resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152'])
parser.add_argument('--backbone-path', type=str, default=None,
help='Path to chekcpointed backbone. It should match the'
' backbone model declared with the --backbone argument.'
' When it is not provided, pretrained model from torchvision'
' will be downloaded.')
parser.add_argument('--report-period', type=int, default=100, help='Report the loss every X times.')
# parser.add_argument('--save-period', type=int, default=-1, help='Save checkpoint every x epochs (disabled if < 1)')
# Multi Gpu
parser.add_argument('--multi_gpu', default=False, type=bool,
help='Whether to use multi gpu to train the model, if use multi gpu, please use by sh.')
#others
parser.add_argument('--amp', action='store_true', default = False,
help='Whether to enable AMP ops. When false, uses TF32 on A100 and FP32 on V100 GPUS.')
args = parser.parse_args()
边栏推荐
- Teacher Cui Xueting's course notes on optimization theory and methods 00 are written in the front
- Shallow reading of condition object source code
- Simulation volume leetcode [normal] 061. rotating linked list
- Actual combat! Talk about how to solve the deep paging problem of MySQL
- MySQL: what happens in the bufferpool when you crud? Ten pictures can make it clear
- STP spanning tree principle and example of election rules
- IDEA找不到Database解决方法
- 2022年SQL经典面试题总结(带解析)
- Windows 上 php 7.4 连接 oracle 配置
- 【CryoEM】FSC, Fourier Shell Correlation简介
猜你喜欢
随机推荐
CVPR2022Oral专题系列(一):低光增强
微信小程序的反编译
王树尧老师运筹学课程笔记 10 线性规划与单纯形法(关于检测数与退化的讨论)
Summary of 2022 SQL classic interview questions (with analysis)
Apisik health check test
Embedding understanding + code
Use of callable
联邦学习后门攻击总结(2019-2022)
SS command details
Teacher Wang Shuyao's notes on operations research 09 linear programming and simplex method (Application of simplex table)
Teacher Wu Enda's machine learning course notes 03 review of linear algebra
Invalid access control
Navicat for Oracle Cannot create oci environment
【论文阅读 | cryoET】Gum-Net:快速准确的3D Subtomo图像对齐和平均的无监督几何匹配
Jetpack Compose 中的键盘处理
Biased lock, lightweight lock test tool class level related commands
阿里一面,给了几条SQL,问需要执行几次树搜索操作?
Teacher wangshuyao's notes on operations research 05 linear programming and simplex method (concept, modeling, standard type)
【冷冻电镜入门】加州理工公开课课程笔记 Part 3: Image Formation
Teacher wangshuyao's notes on operations research course 08 linear programming and simplex method (simplex method)








