当前位置:网站首页>torch DDP Training

torch DDP Training

2022-06-22 04:26:00 Love CV

01

There are three types of distributed training

The model is split into different GPU, The model is too big , It's almost useless

The model is placed in a , Data splitting is different GPU,torch.dataparallel

  • Basically, I won't report bug

  • sync bc Prepare yourself

Models and data are different gpu There is one on each , torch.distributeddataparallel

  • bug many , Data is not shared between processes , Access to the file is uncertain , In the log system , Data set preprocessing , Model loss Put it on the designated cuda Such places should be carefully designed .

  • sync yes pytorch Existing library

  • The principle and effect are theoretically the same as 2 Agreement , They all use bigger ones batchsize, It's really faster than that 2 fast , There seems to be a significant reduction in data to cuda When the

  • Support multiple computers

  • Cartido , Network running time is short , Actually, it's not as good as 2

02

principle

increase bs, Will bring about an increase bs Related disadvantages of

  • Over fitting , Use warm-up relieve , You need to explore how much you can increase without affecting generalization

  • Corresponding to multiplying learning rate , Data one epoch Less n times , And learning rate

DP Aggregate gradient , however bn Is based on a single gpu Data calculation , There will be incorrect situations , Use sync bn

map-reduce, Every gpu Get the last one , Pass it on to the next

  • There are two rounds altogether , The first round has all the data on each card , The second round synchronizes data to all cards

  • You only need to 1/N The data of , need 2N-2 Time , So theoretically, it is similar to GPU No matter the number

Model buffer, It's not a parameter , Its optimization is not back propagation but other ways , Such as bn Of variance and moving mean

03

DDP How to write specifically

  • You can call dist To see the current rank, after log Wait for tasks that don't need to be repeated rank=0 Conduct

  • When not in use by default rank=0

  • Use a card first debug

  • Use wandb Words , You need to call... Explicitly wandb.finish()

import torch.distributed as distfrom torch.nn.parallel import DistributedDataParallel as DDPimport torch.multiprocessing as mpdef demo_fn(rank, world_size):    dist.init_process_group("nccl", rank=rank, world_size=world_size)    # lots of code.    if dist.get_rank() == 0:        train_sampler = torch.utils.data.distributed.DistributedSampler(my_trainset)    trainloader = torch.utils.data.DataLoader(my_trainset,                 batch_size=16, num_workers=2, sampler=train_sampler)
    model = ToyModel()#.to(local_rank)    # DDP: Load The model has to be constructed DDP Before the model , And only in the master Just load it on .    # ckpt_path = None    # if dist.get_rank() == 0 and ckpt_path is not None:    #    model.load_state_dict(torch.load(ckpt_path))    model = DDP(model, device_ids=[local_rank], output_device=local_rank)        # DDP: It should be noted that , there batch_size It refers to each process batch_size.        #       in other words , total batch_size It's here batch_size Multiply by the parallel number (world_size).
    # torch.cuda.set_device(local_rank)    # dist.init_process_group(backend='nccl')    loss_func = nn.CrossEntropyLoss().to(local_rank)    trainloader.sampler.set_epoch(epoch)    data, label = data.to(local_rank), label.to(local_rank)if dist.get_rank() == 0:        torch.save(model.module.state_dict(), "%d.ckpt" % epoch)        def run_demo(demo_fn, world_size):    mp.spawn(demo_fn,             args=(world_size,),             nprocs=world_size,             join=True)
原网站

版权声明
本文为[Love CV]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220422412499.html