当前位置：网站首页>Pytorch deep learning single card training and multi card training

Pytorch deep learning single card training and multi card training

2022-07-28 06:06:00 【Alan and fish】

Single machine single card training mode

 #  Set up GPU Parameters , Whether to use GPU, Use that piece GPU
    if config.use_gpu and torch.cuda.is_available():
        device=torch.device('cuda',config.gpu_id)
    else:
        device=torch.device('cpu')
    #  Check the GPU Whether it can be used 
    print('GPU Is it available :'+str(torch.cuda.is_available()))

Single machine multi card training mode

Single Machine Data Parallel（ Single machine multi card mode ） This version has been eliminated

from torch.nn.parallel import DataParallel

device_id=[0,1,2,3]
device=torch.device('cuda:{}'.format(device_id[0])) #  Set up 0 Number GPU It is the Lord. GPU
model=model.to(device)
model=DataParallel(model,device_ids=device_id,output_device=device)

First, all the data will be distributed to the list GPU Training , And then again gather To the Lord GPU Calculation loss

DistributedParallel( abbreviation DDP, Multi process multi card training )

Code becomes a process :
1. Initialize process group

torch.distributed.init rocess_group(backend="nccl", world_size=n_gpus,rank=args.local_rank)
# backend: Process mode 
# word_size: Now this GPU How many cards are there 
# rank: Specify where the current process is GPU On

2. Set up CUDA_VISIBLE_DEVICES environment variable

torch.cuda.set_device(args.local_rank)

3. Wrap the model

model = DistributedDataParallel(model.cuda(args.local_rank), device_ids=[args.local_rank])
~~~Python
* 4. Allocate the data of each card

train_sampler = DistributedSampler(train_dataset)

The source code is located in torch/utils/data/distributed.py

* 5. Pass data to dataload in , The data passed in does not need suffer 了 
* 6. Copy the data to cuda On 
~~~Python
data=data.cuda(args.local_rank)

7. Carry out orders ( In the use of ddp When training in this way , You need to use the command to execute )

python -m torch.distributed.launch--nproc_per_node=n_gpu train.py

8. Save the model

torch.save stay local_rank=O To save , Also pay attention to calling model.module.state_dict()
torch.load  Be careful map_location

matters needing attention :

train.py There must be acceptance in local_rank Parameter options for ,launch This parameter will be passed in
Of each process batch_size It should be a GPU The required batch_size size
At the beginning of each cycle , call train_sampler.set_epoch(epoch) It can make the data fully disordered
With sampler, Don't in DataLoader Set in shuffle=True 了

Complete code

#  System related 
import argparse
import os

#  Frame related 
import torch
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn as nn
import os
#  Custom package 
from BruceNRE.config import config
from BruceNRE.utils import make_seed,load_pkl
from BruceNRE.process import process
from BruceNRE.dataset import CustomDataset,collate_fn
from BruceNRE import models
from BruceNRE.trainer import train,validate

#  Import distributed training dependencies 
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel


__Models__={
    "BruceCNN":models.BruceCNN
}

parser=argparse.ArgumentParser(description=" Relationship extraction ")
parser.add_argument("--model_name",type=str,default='BruceCNN',help='model name')
parser.add_argument('--local_rank',type=int,default=1,help='local device id on current node')



if __name__=="__main__":
# ==================== Key code ==================================
    os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
    #  Distributed training initialization 
    torch.distributed.init_process_group(backend="nccl")
    #  Set the current device to use only this card 
    torch.cuda.set_device(args.local_rank)
    #  Single machine multi card : It means how many pieces GPU
    args.word_size=int(os.getenv("WORLD_SIZE",'1'))
    #  Get the sequence number of the current process , For interprocess communication 
    args.global_rank=dist.get_rank()
#=============================================================

    model_name=args.model_name if args.model_name else config.model_name
    #  In order to ensure that the model is the same every time it is trained , Set an initialization seed 
    make_seed(config.seed)

    #  Data preprocessing 
    process(config.data_path,config.out_path,file_type='csv')

    #  Load data 
    vocab_path=os.path.join(config.out_path,'vocab.pkl')
    train_data_path=os.path.join(config.out_path,'train.pkl')
    test_data_path=os.path.join(config.out_path,'test.pkl')

    vocab=load_pkl(vocab_path,'vocab')
    vocab_size=len(vocab.word2idx)
    
    #CustomDataset Is inherited torch.util.data Of Dataset A class of class , For data loading , For details, see Dataset
    train_dataset = CustomDataset(train_data_path, 'train-data')
    test_dataset = CustomDataset(test_data_path, 'test-data')

    #  test CNN Model 
    model=__Models__[model_name](vocab_size,config)
    print(model)
#===================== Key code =================================
    #  Definition , And put the model in GPU On 
    local_rank = torch.distributed.get_rank()
    torch.cuda.set_device(local_rank)
    global device
    device = torch.device("cuda", local_rank)
    #  Copy model , Put the model in DistributedDataParallelAPI
    model.to(device)
    #  Load more GPU
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank,
                                                      find_unused_parameters=True)
    #  Construct a train-sample
    train_sample = DistributedSampler(train_dataset)
    test_sample=DistributedSampler(test_dataset)
  #  Using distributed training , Must put suffle Set to False, because DistributedSampler It will disturb the data 
    train_dataloader = DataLoader(
        dataset=train_dataset,
        batch_size=config.batch_size,
        shuffle=False,
        drop_last=True,
        collate_fn=collate_fn,
        sampler=train_sample
    )

    test_dataloader = DataLoader(
        dataset=test_dataset,
        batch_size=config.batch_size,
        shuffle=False,
        drop_last=True,
        collate_fn=collate_fn,
        sampler=test_sample
    )
#  =============================================
    #  Build optimizer 
    optimizer=optim.Adam(model.parameters(),lr=config.learing_rate)
    scheduler=optim.lr_scheduler.ReduceLROnPlateau(optimizer,'max',factor=config.decay_rate,patience=config.decay_patience)

    #  Loss function : Cross entropy 
    loss_fn=nn.CrossEntropyLoss()

    #  The evaluation index , Micro average , Macro average 
    best_macro_f1,best_macro_epoch=0,1
    best_micro_f1,best_micro_epoch=0,1
    best_macro_model,best_micro_model='',''
    print("*************************** Start training *******************************")
    for epoch in  range(1,config.epoch+1):
        train_sample.set_epoch(epoch) #  Let each card get random data in each cycle 
        train(epoch,device,train_dataloader,model,optimizer,loss_fn,config)
        macro_f1,micro_f1=validate(test_dataloader,device,model,config)
        model_name=model.module.save(epoch=epoch)
        scheduler.step(macro_f1)

        if macro_f1>best_macro_f1:
            best_macro_f1=macro_f1
            best_macro_epoch=epoch
            best_macro_model=model_name

        if micro_f1>best_micro_f1:
            best_micro_f1=micro_f1
            best_micro_epoch=epoch
            best_micro_model=model_name


    print("========================= Model training complete ==================================")

    print(f'best macro f1:{best_macro_f1:.4f}',f'in epoch:{best_macro_epoch},save in:{best_macro_model}')
    print(f'best micro f1:{best_micro_f1:.4f}',f'in epoch:{best_macro_epoch},save_in:{best_micro_model}')

Last in shell The following statement used in the background runs ( For the time being, I only find this method works , Other methods need to be found )

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py

among

torch.distributed.launch Means to start training in a distributed way ,
nproc_per_node Specify the total number of nodes , It can be set to the number of graphics cards

原网站

版权声明
本文为[Alan and fish]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518168966.html