当前位置：网站首页>Implementation of DDP cluster distributed training under pytoch multi GPU conditions (brief introduction - from scratch)

Implementation of DDP cluster distributed training under pytoch multi GPU conditions (brief introduction - from scratch)

2022-07-29 06:57:00 【Wait for Godot.】

List of articles

Preface
- Relevant concepts
Single machine many GPU Implementation details
follow-up work
- of Batch_size Some of the settings for
Reference and thanks
appendix ：
- argparse Reference resources ：

Preface

I've been trying for two days Pytorch More in the environment GPU Model training , Summarize a note that can be completely realized from scratch . It took one night plus one noon , Finally succeeded . This is recorded here , Convenient for future reference .

Relevant concepts

Thank you very much ：「 Freshman Handbook 」：PyTorch Distributed training

group： Process group , In most cases DDP The processes of are under the same process group .
world_size： Total number of processes , ( In principle, one process Take up one GPU Is better ), So it can be understood as GPU number .
rank： Sequence number of the current process , For interprocess communication ,rank = 0 The host of is master node .
local_rank： Corresponding to the current process GPU Number .

Corresponding examples

stand-alone 8 Card distributed training . At this moment world size = 8, That is to say 8 A process , Its rank The numbers are 0-7, and local_rank Also for the 0-7.( Note in the case of single machine multitasking CUDA_VISIBLE_DEVICES The use of controls visible to different programs GPU devices)
Two 16 Card distributed training . At this time, each machine is 8 card , in total 16 card ,world_size = 16, That is to say 16 A process , Its rank The number is 0-15, But on every machine ,local_rank still 0-7, This is a local rank And rank The difference between , local rank It will correspond to the actual GPU ID On .

Single machine many GPU Implementation details

preparation

In the realization of many GPU Before training , You need to ensure your train.py It can run completely , It includes the following modules ：

dataset modular
model modular
loss modular
optimizer modular
log modular
Model saving module
Model loading module

relevant DDP Package import ：

import os

import torch
import torch.distributed as dist
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from torch.nn.parallel import DistributedDataParallel as DDP

Code related processes ：

DDP The basic usage of ( The coding process )

Use torch.distributed.init_process_group Initialize process group
Use torch.nn.parallel.DistributedDataParallel establish Distributed model
Use torch.utils.data.distributed.DistributedSampler establish DataLoader
Adjust other necessary places (tensor Put to designated device On ,S/L checkpoint, Index calculation, etc )
Use torchrun Start training

1. Initialize process group

Set the following function ：

def init_distributed_mode(args):
    # set up distributed device
    args.rank = int(os.environ["RANK"])
    args.local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(args.rank % torch.cuda.device_count())
    dist.init_process_group(backend="nccl")
    args.device = torch.device("cuda", args.local_rank)
    print(args.device,'argsdevice')
    args.NUM_gpu = torch.distributed.get_world_size()
    print(f"[init] == local rank: {
      args.local_rank}, global rank: {
      args.rank} ==")

In the main function train.py in , Initialize the process group ：
Be careful ： The learning rate also increases GPU The number of changes .

    # Initialize Multi GPU 

    if args.multi_gpu == True :
        init_distributed_mode(args)
    else: 
        # Use Single Gpu 
        os.environ['CUDA_VISIBLE_DEVICES'] = args.device_gpu
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        print(f'Using {
      device} device')
        args.device = device   
    #The learning rate is automatically scaled 
    # (in other words, multiplied by the number of GPUs and multiplied by the batch size divided by 32).
    args.lr = args.lr * args.NUM_gpu * (args.batch_size / 32)

2. Create a distributed model

Load well model After module , Create a distributed model ：

model = model.cuda()
if args.multi_gpu:
    # DistributedDataParallel
    ssd300 = DDP(model , device_ids=[args.local_rank], output_device=args.local_rank)

3. establish Dataloader （ With the first 2 There is no sequence of steps ）

    train_dataset = COCODetection(root=args.data.DATASET_PATH,image_set='train2017', 
                        transform=SSDTransformer(dboxes))

    val_dataset = COCODetection(root=args.data.DATASET_PATH,image_set='val2017', 
                        transform=SSDTransformer(dboxes, val=True))
    
    if args.multi_gpu:
        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,shuffle=True)
        val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
        train_shuffle = False
    else:
        train_sampler = None
        val_sampler = None
        train_shuffle = True

    train_loader = torch.utils.data.DataLoader(train_dataset, args.batch_size,
                                  num_workers=args.num_workers,
                                  shuffle=train_shuffle, 
                                  sampler=train_sampler,
                                  pin_memory=True)

    val_loader = torch.utils.data.DataLoader(val_dataset,
                                batch_size=args.batch_size,
                                shuffle=False,  # Note: distributed sampler is shuffled :(
                                sampler=val_sampler,
                                num_workers=args.num_workers)

4. Some precautions ： ( Should not be neglected )

Saving the model , Or record Log When you file , Be sure to judge in advance whether it is on the main thread , namely args.local_rank == 0, Otherwise, it will be recorded or saved repeatedly .

if args.local_rank == 0:
                log.logger.info(epoch, acc)
# Save model
if args.save and args.local_rank == 0:
    print("saving model...")

5. Shell File execution

For single machine with multiple cards ：
newly build multi_gpu.sh file ,

#exmaple: 1 node, 2 GPUs per node (2GPUs)

CUDA_VISIBLE_DEVICES=3,4 torchrun \
    --nproc_per_node=2 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=22222 \
    train.py --multi_gpu=True

Simply explain the parameters ：

–nproc_per_node It refers to the number of processes in each stage , Every machine here 2 card , So it is 2

–nnodes Number of nodes , Here are two machines , So it is 1

–node_rank node rank, For the first machine 0, The second machine is 1, There is only one machine here , Namely 0 了 .

–master_addr The master node ip, Here I fill in the local of the first machine ip,localhost, In case of multiple machines, you need to fill in the LAN corresponding to the machine IP, There is no condition to try this kind of multi machine situation .

–master_port The port number of the primary node , Just give it whatever you want ( Useless port ).

follow-up work

The following sample code will be synchronized to my template library ,Templete, You can go to see if you are interested .

of Batch_size Some of the settings for

because DistributedDataParallel Is at the end of each GPU There is a new process , So it's set at this time batch size It actually means a single GPU above batch size size . for instance , Used 2 Servers , Each server uses 8 Zhang GPU, then batch size Set up in order to 32, So practical batch size by 3282=512, So actually batch size You didn't set it up batch size.

Reference and thanks

「 Freshman Handbook 」：PyTorch Distributed training

pytorch many gpu Parallel training

appendix ：

argparse Reference resources ：

There are many useless , Find useful references , Here are all copied .

    parser = argparse.ArgumentParser(description='Train Single Shot MultiBox Detector on COCO')
    parser.add_argument('--model_name', default='SSD300', type=str,
                        help='The model name')
    parser.add_argument('--model_config', default='configs/SSD300.yaml', 
                        metavar='FILE', help='path to model cfg file', type=str,)
    parser.add_argument('--data_config', default='data/coco.yaml', 
                        metavar='FILE', help='path to data cfg file', type=str,)
    parser.add_argument('--device_gpu', default='3,4', type=str,
                        help='Cuda device, i.e. 0 or 0,1,2,3')
    parser.add_argument('--checkpoint', default=None, help='The checkpoint path')
    parser.add_argument('--save', type=str, default='checkpoints',
                        help='save model checkpoints in the specified directory')
    parser.add_argument('--mode', type=str, default='training',
                        choices=['training', 'evaluation', 'benchmark-training', 'benchmark-inference'])
    parser.add_argument('--epochs', '-e', type=int, default=65,
                        help='number of epochs for training')
    parser.add_argument('--evaluation', nargs='*', type=int, default=[21, 31, 37, 42, 48, 53, 59, 64],
                        help='epochs at which to evaluate')
    parser.add_argument('--multistep', nargs='*', type=int, default=[43, 54],
                        help='epochs at which to decay learning rate')
    parser.add_argument('--warmup', type=int, default=None)
    parser.add_argument('--seed', '-s', default = 42 , type=int, help='manually set random seed for torch')
    
    # Hyperparameters
    parser.add_argument('--lr', type=float, default=2.6e-3,
                        help='learning rate for SGD optimizer')
    parser.add_argument('--momentum', '-m', type=float, default=0.9,
                        help='momentum argument for SGD optimizer')
    parser.add_argument('--weight_decay', '--wd', type=float, default=0.0005,
                        help='weight-decay for SGD optimizer')
    parser.add_argument('--batch_size', '--bs', type=int, default=64,
                        help='number of examples for each iteration')
    parser.add_argument('--num_workers', type=int, default=8) 
    
    parser.add_argument('--backbone', type=str, default='resnet50',
                        choices=['resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152'])
    parser.add_argument('--backbone-path', type=str, default=None,
                        help='Path to chekcpointed backbone. It should match the'
                             ' backbone model declared with the --backbone argument.'
                             ' When it is not provided, pretrained model from torchvision'
                             ' will be downloaded.')
    parser.add_argument('--report-period', type=int, default=100, help='Report the loss every X times.')
    
    # parser.add_argument('--save-period', type=int, default=-1, help='Save checkpoint every x epochs (disabled if < 1)')
    
    # Multi Gpu
    parser.add_argument('--multi_gpu', default=False, type=bool,
                        help='Whether to use multi gpu to train the model, if use multi gpu, please use by sh.')
    
    #others 
    parser.add_argument('--amp', action='store_true', default = False,
                        help='Whether to enable AMP ops. When false, uses TF32 on A100 and FP32 on V100 GPUS.')

    args = parser.parse_args()