当前位置：网站首页>torch DDP Training

torch DDP Training

2022-06-22 04:26:00 【Love CV】

There are three types of distributed training

The model is split into different GPU, The model is too big , It's almost useless

The model is placed in a , Data splitting is different GPU,torch.dataparallel

Basically, I won't report bug
sync bc Prepare yourself

Models and data are different gpu There is one on each , torch.distributeddataparallel

bug many , Data is not shared between processes , Access to the file is uncertain , In the log system , Data set preprocessing , Model loss Put it on the designated cuda Such places should be carefully designed .
sync yes pytorch Existing library
The principle and effect are theoretically the same as 2 Agreement , They all use bigger ones batchsize, It's really faster than that 2 fast , There seems to be a significant reduction in data to cuda When the
Support multiple computers
Cartido , Network running time is short , Actually, it's not as good as 2

principle

increase bs, Will bring about an increase bs Related disadvantages of

Over fitting , Use warm-up relieve , You need to explore how much you can increase without affecting generalization
Corresponding to multiplying learning rate , Data one epoch Less n times , And learning rate

DP Aggregate gradient , however bn Is based on a single gpu Data calculation , There will be incorrect situations , Use sync bn

map-reduce, Every gpu Get the last one , Pass it on to the next

There are two rounds altogether , The first round has all the data on each card , The second round synchronizes data to all cards
You only need to 1/N The data of , need 2N-2 Time , So theoretically, it is similar to GPU No matter the number

Model buffer, It's not a parameter , Its optimization is not back propagation but other ways , Such as bn Of variance and moving mean

DDP How to write specifically

You can call dist To see the current rank, after log Wait for tasks that don't need to be repeated rank=0 Conduct
When not in use by default rank=0
Use a card first debug
Use wandb Words , You need to call... Explicitly wandb.finish()

import torch.distributed as distfrom torch.nn.parallel import DistributedDataParallel as DDPimport torch.multiprocessing as mpdef demo_fn(rank, world_size):    dist.init_process_group("nccl", rank=rank, world_size=world_size)    # lots of code.    if dist.get_rank() == 0:        train_sampler = torch.utils.data.distributed.DistributedSampler(my_trainset)    trainloader = torch.utils.data.DataLoader(my_trainset,                 batch_size=16, num_workers=2, sampler=train_sampler)
    model = ToyModel()#.to(local_rank)    # DDP: Load The model has to be constructed DDP Before the model , And only in the master Just load it on .    # ckpt_path = None    # if dist.get_rank() == 0 and ckpt_path is not None:    #    model.load_state_dict(torch.load(ckpt_path))    model = DDP(model, device_ids=[local_rank], output_device=local_rank)        # DDP： It should be noted that , there batch_size It refers to each process batch_size.        #       in other words , total batch_size It's here batch_size Multiply by the parallel number (world_size).
    # torch.cuda.set_device(local_rank)    # dist.init_process_group(backend='nccl')    loss_func = nn.CrossEntropyLoss().to(local_rank)    trainloader.sampler.set_epoch(epoch)    data, label = data.to(local_rank), label.to(local_rank)if dist.get_rank() == 0:        torch.save(model.module.state_dict(), "%d.ckpt" % epoch)        def run_demo(demo_fn, world_size):    mp.spawn(demo_fn,             args=(world_size,),             nprocs=world_size,             join=True)

原网站

版权声明
本文为[Love CV]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206220422412499.html

当前位置：网站首页>torch DDP Training

torch DDP Training

边栏推荐

猜你喜欢

随机推荐