当前位置:网站首页>Pytorch deep learning single card training and multi card training
Pytorch deep learning single card training and multi card training
2022-07-28 06:06:00 【Alan and fish】
Single machine single card training mode
# Set up GPU Parameters , Whether to use GPU, Use that piece GPU
if config.use_gpu and torch.cuda.is_available():
device=torch.device('cuda',config.gpu_id)
else:
device=torch.device('cpu')
# Check the GPU Whether it can be used
print('GPU Is it available :'+str(torch.cuda.is_available()))
Single machine multi card training mode
- Single Machine Data Parallel( Single machine multi card mode ) This version has been eliminated

from torch.nn.parallel import DataParallel
device_id=[0,1,2,3]
device=torch.device('cuda:{}'.format(device_id[0])) # Set up 0 Number GPU It is the Lord. GPU
model=model.to(device)
model=DataParallel(model,device_ids=device_id,output_device=device)
First, all the data will be distributed to the list GPU Training , And then again gather To the Lord GPU Calculation loss
- DistributedParallel( abbreviation DDP, Multi process multi card training )

Code becomes a process : - 1. Initialize process group
torch.distributed.init rocess_group(backend="nccl", world_size=n_gpus,rank=args.local_rank)
# backend: Process mode
# word_size: Now this GPU How many cards are there
# rank: Specify where the current process is GPU On
- 2. Set up CUDA_VISIBLE_DEVICES environment variable
torch.cuda.set_device(args.local_rank)
- 3. Wrap the model
model = DistributedDataParallel(model.cuda(args.local_rank), device_ids=[args.local_rank])
~~~Python
* 4. Allocate the data of each card
train_sampler = DistributedSampler(train_dataset)
The source code is located in torch/utils/data/distributed.py
* 5. Pass data to dataload in , The data passed in does not need suffer 了
* 6. Copy the data to cuda On
~~~Python
data=data.cuda(args.local_rank)
- 7. Carry out orders ( In the use of ddp When training in this way , You need to use the command to execute )
python -m torch.distributed.launch--nproc_per_node=n_gpu train.py
- 8. Save the model
torch.save stay local_rank=O To save , Also pay attention to calling model.module.state_dict()
torch.load Be careful map_location
matters needing attention :
- train.py There must be acceptance in local_rank Parameter options for ,launch This parameter will be passed in
- Of each process batch_size It should be a GPU The required batch_size size
- At the beginning of each cycle , call train_sampler.set_epoch(epoch) It can make the data fully disordered
- With sampler, Don't in DataLoader Set in shuffle=True 了
Complete code
# System related
import argparse
import os
# Frame related
import torch
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn as nn
import os
# Custom package
from BruceNRE.config import config
from BruceNRE.utils import make_seed,load_pkl
from BruceNRE.process import process
from BruceNRE.dataset import CustomDataset,collate_fn
from BruceNRE import models
from BruceNRE.trainer import train,validate
# Import distributed training dependencies
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
__Models__={
"BruceCNN":models.BruceCNN
}
parser=argparse.ArgumentParser(description=" Relationship extraction ")
parser.add_argument("--model_name",type=str,default='BruceCNN',help='model name')
parser.add_argument('--local_rank',type=int,default=1,help='local device id on current node')
if __name__=="__main__":
# ==================== Key code ==================================
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
# Distributed training initialization
torch.distributed.init_process_group(backend="nccl")
# Set the current device to use only this card
torch.cuda.set_device(args.local_rank)
# Single machine multi card : It means how many pieces GPU
args.word_size=int(os.getenv("WORLD_SIZE",'1'))
# Get the sequence number of the current process , For interprocess communication
args.global_rank=dist.get_rank()
#=============================================================
model_name=args.model_name if args.model_name else config.model_name
# In order to ensure that the model is the same every time it is trained , Set an initialization seed
make_seed(config.seed)
# Data preprocessing
process(config.data_path,config.out_path,file_type='csv')
# Load data
vocab_path=os.path.join(config.out_path,'vocab.pkl')
train_data_path=os.path.join(config.out_path,'train.pkl')
test_data_path=os.path.join(config.out_path,'test.pkl')
vocab=load_pkl(vocab_path,'vocab')
vocab_size=len(vocab.word2idx)
#CustomDataset Is inherited torch.util.data Of Dataset A class of class , For data loading , For details, see Dataset
train_dataset = CustomDataset(train_data_path, 'train-data')
test_dataset = CustomDataset(test_data_path, 'test-data')
# test CNN Model
model=__Models__[model_name](vocab_size,config)
print(model)
#===================== Key code =================================
# Definition , And put the model in GPU On
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
global device
device = torch.device("cuda", local_rank)
# Copy model , Put the model in DistributedDataParallelAPI
model.to(device)
# Load more GPU
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank,
find_unused_parameters=True)
# Construct a train-sample
train_sample = DistributedSampler(train_dataset)
test_sample=DistributedSampler(test_dataset)
# Using distributed training , Must put suffle Set to False, because DistributedSampler It will disturb the data
train_dataloader = DataLoader(
dataset=train_dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=True,
collate_fn=collate_fn,
sampler=train_sample
)
test_dataloader = DataLoader(
dataset=test_dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=True,
collate_fn=collate_fn,
sampler=test_sample
)
# =============================================
# Build optimizer
optimizer=optim.Adam(model.parameters(),lr=config.learing_rate)
scheduler=optim.lr_scheduler.ReduceLROnPlateau(optimizer,'max',factor=config.decay_rate,patience=config.decay_patience)
# Loss function : Cross entropy
loss_fn=nn.CrossEntropyLoss()
# The evaluation index , Micro average , Macro average
best_macro_f1,best_macro_epoch=0,1
best_micro_f1,best_micro_epoch=0,1
best_macro_model,best_micro_model='',''
print("*************************** Start training *******************************")
for epoch in range(1,config.epoch+1):
train_sample.set_epoch(epoch) # Let each card get random data in each cycle
train(epoch,device,train_dataloader,model,optimizer,loss_fn,config)
macro_f1,micro_f1=validate(test_dataloader,device,model,config)
model_name=model.module.save(epoch=epoch)
scheduler.step(macro_f1)
if macro_f1>best_macro_f1:
best_macro_f1=macro_f1
best_macro_epoch=epoch
best_macro_model=model_name
if micro_f1>best_micro_f1:
best_micro_f1=micro_f1
best_micro_epoch=epoch
best_micro_model=model_name
print("========================= Model training complete ==================================")
print(f'best macro f1:{best_macro_f1:.4f}',f'in epoch:{best_macro_epoch},save in:{best_macro_model}')
print(f'best micro f1:{best_micro_f1:.4f}',f'in epoch:{best_macro_epoch},save_in:{best_micro_model}')
Last in shell The following statement used in the background runs ( For the time being, I only find this method works , Other methods need to be found )
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py
among
- torch.distributed.launch Means to start training in a distributed way ,
- nproc_per_node Specify the total number of nodes , It can be set to the number of graphics cards
边栏推荐
猜你喜欢

小程序开发哪家更靠谱呢?

Regular verification rules of wechat applet mobile number

字节Android岗4轮面试,收到 50k*18 Offer,裁员风口下成功破局

手撸一个简单的RPC(<-_<-)

Use Python to encapsulate a tool class that sends mail regularly

【7】 Consistency between redis cache and database data

2: Why read write separation

微信上的小程序店铺怎么做?

微信小程序开发语言一般有哪些?

【四】redis持久化(RDB与AOF)
随机推荐
pytorch深度学习单卡训练和多卡训练
Digital collections strengthen reality with emptiness, enabling the development of the real economy
Nlp项目实战自定义模板框架
trino函数随记
At the moment of the epidemic, online and offline travelers are trapped. Can the digital collection be released?
NLP中基于Bert的数据预处理
Micro service architecture cognition and service governance Eureka
CertPathValidatorException:validity check failed
flutter webivew input唤起相机相册
微信小程序开发制作注意这几个重点方面
微服务架构认知、服务治理-Eureka
Sales notice: on July 22, the "great heat" will be sold, and the [traditional national wind 24 solar terms] will be sold in summer.
Structured streaming in spark
ModuleNotFoundError: No module named ‘pip‘
Two methods of covering duplicate records in tables in MySQL
简单理解一下MVC和三层架构
用于排序的sort方法
Mysql5.6 (according to.Ibd,.Frm file) restore single table data
Flume installation and use
Installation and use of sqoop