当前位置:网站首页>Pytorch deep learning single card training and multi card training
Pytorch deep learning single card training and multi card training
2022-07-28 06:06:00 【Alan and fish】
Single machine single card training mode
# Set up GPU Parameters , Whether to use GPU, Use that piece GPU
if config.use_gpu and torch.cuda.is_available():
device=torch.device('cuda',config.gpu_id)
else:
device=torch.device('cpu')
# Check the GPU Whether it can be used
print('GPU Is it available :'+str(torch.cuda.is_available()))
Single machine multi card training mode
- Single Machine Data Parallel( Single machine multi card mode ) This version has been eliminated

from torch.nn.parallel import DataParallel
device_id=[0,1,2,3]
device=torch.device('cuda:{}'.format(device_id[0])) # Set up 0 Number GPU It is the Lord. GPU
model=model.to(device)
model=DataParallel(model,device_ids=device_id,output_device=device)
First, all the data will be distributed to the list GPU Training , And then again gather To the Lord GPU Calculation loss
- DistributedParallel( abbreviation DDP, Multi process multi card training )

Code becomes a process : - 1. Initialize process group
torch.distributed.init rocess_group(backend="nccl", world_size=n_gpus,rank=args.local_rank)
# backend: Process mode
# word_size: Now this GPU How many cards are there
# rank: Specify where the current process is GPU On
- 2. Set up CUDA_VISIBLE_DEVICES environment variable
torch.cuda.set_device(args.local_rank)
- 3. Wrap the model
model = DistributedDataParallel(model.cuda(args.local_rank), device_ids=[args.local_rank])
~~~Python
* 4. Allocate the data of each card
train_sampler = DistributedSampler(train_dataset)
The source code is located in torch/utils/data/distributed.py
* 5. Pass data to dataload in , The data passed in does not need suffer 了
* 6. Copy the data to cuda On
~~~Python
data=data.cuda(args.local_rank)
- 7. Carry out orders ( In the use of ddp When training in this way , You need to use the command to execute )
python -m torch.distributed.launch--nproc_per_node=n_gpu train.py
- 8. Save the model
torch.save stay local_rank=O To save , Also pay attention to calling model.module.state_dict()
torch.load Be careful map_location
matters needing attention :
- train.py There must be acceptance in local_rank Parameter options for ,launch This parameter will be passed in
- Of each process batch_size It should be a GPU The required batch_size size
- At the beginning of each cycle , call train_sampler.set_epoch(epoch) It can make the data fully disordered
- With sampler, Don't in DataLoader Set in shuffle=True 了
Complete code
# System related
import argparse
import os
# Frame related
import torch
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn as nn
import os
# Custom package
from BruceNRE.config import config
from BruceNRE.utils import make_seed,load_pkl
from BruceNRE.process import process
from BruceNRE.dataset import CustomDataset,collate_fn
from BruceNRE import models
from BruceNRE.trainer import train,validate
# Import distributed training dependencies
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
__Models__={
"BruceCNN":models.BruceCNN
}
parser=argparse.ArgumentParser(description=" Relationship extraction ")
parser.add_argument("--model_name",type=str,default='BruceCNN',help='model name')
parser.add_argument('--local_rank',type=int,default=1,help='local device id on current node')
if __name__=="__main__":
# ==================== Key code ==================================
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
# Distributed training initialization
torch.distributed.init_process_group(backend="nccl")
# Set the current device to use only this card
torch.cuda.set_device(args.local_rank)
# Single machine multi card : It means how many pieces GPU
args.word_size=int(os.getenv("WORLD_SIZE",'1'))
# Get the sequence number of the current process , For interprocess communication
args.global_rank=dist.get_rank()
#=============================================================
model_name=args.model_name if args.model_name else config.model_name
# In order to ensure that the model is the same every time it is trained , Set an initialization seed
make_seed(config.seed)
# Data preprocessing
process(config.data_path,config.out_path,file_type='csv')
# Load data
vocab_path=os.path.join(config.out_path,'vocab.pkl')
train_data_path=os.path.join(config.out_path,'train.pkl')
test_data_path=os.path.join(config.out_path,'test.pkl')
vocab=load_pkl(vocab_path,'vocab')
vocab_size=len(vocab.word2idx)
#CustomDataset Is inherited torch.util.data Of Dataset A class of class , For data loading , For details, see Dataset
train_dataset = CustomDataset(train_data_path, 'train-data')
test_dataset = CustomDataset(test_data_path, 'test-data')
# test CNN Model
model=__Models__[model_name](vocab_size,config)
print(model)
#===================== Key code =================================
# Definition , And put the model in GPU On
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
global device
device = torch.device("cuda", local_rank)
# Copy model , Put the model in DistributedDataParallelAPI
model.to(device)
# Load more GPU
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank,
find_unused_parameters=True)
# Construct a train-sample
train_sample = DistributedSampler(train_dataset)
test_sample=DistributedSampler(test_dataset)
# Using distributed training , Must put suffle Set to False, because DistributedSampler It will disturb the data
train_dataloader = DataLoader(
dataset=train_dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=True,
collate_fn=collate_fn,
sampler=train_sample
)
test_dataloader = DataLoader(
dataset=test_dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=True,
collate_fn=collate_fn,
sampler=test_sample
)
# =============================================
# Build optimizer
optimizer=optim.Adam(model.parameters(),lr=config.learing_rate)
scheduler=optim.lr_scheduler.ReduceLROnPlateau(optimizer,'max',factor=config.decay_rate,patience=config.decay_patience)
# Loss function : Cross entropy
loss_fn=nn.CrossEntropyLoss()
# The evaluation index , Micro average , Macro average
best_macro_f1,best_macro_epoch=0,1
best_micro_f1,best_micro_epoch=0,1
best_macro_model,best_micro_model='',''
print("*************************** Start training *******************************")
for epoch in range(1,config.epoch+1):
train_sample.set_epoch(epoch) # Let each card get random data in each cycle
train(epoch,device,train_dataloader,model,optimizer,loss_fn,config)
macro_f1,micro_f1=validate(test_dataloader,device,model,config)
model_name=model.module.save(epoch=epoch)
scheduler.step(macro_f1)
if macro_f1>best_macro_f1:
best_macro_f1=macro_f1
best_macro_epoch=epoch
best_macro_model=model_name
if micro_f1>best_micro_f1:
best_micro_f1=micro_f1
best_micro_epoch=epoch
best_micro_model=model_name
print("========================= Model training complete ==================================")
print(f'best macro f1:{best_macro_f1:.4f}',f'in epoch:{best_macro_epoch},save in:{best_macro_model}')
print(f'best micro f1:{best_micro_f1:.4f}',f'in epoch:{best_macro_epoch},save_in:{best_micro_model}')
Last in shell The following statement used in the background runs ( For the time being, I only find this method works , Other methods need to be found )
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 main.py
among
- torch.distributed.launch Means to start training in a distributed way ,
- nproc_per_node Specify the total number of nodes , It can be set to the number of graphics cards
边栏推荐
- Use Python to encapsulate a tool class that sends mail regularly
- 【三】redis特点功能
- 自动定时备份远程mysql脚本
- 【7】 Consistency between redis cache and database data
- Digital collections strengthen reality with emptiness, enabling the development of the real economy
- Books - Sun Tzu's art of war
- 【六】redis缓存策略
- Sqoop安装及使用
- Flink CDC (Mysql为例)
- 如何选择小程序开发企业
猜你喜欢
随机推荐
记录下在线扩容服务器遇到的问题 NOCHANGE: partition 1 is size 419428319. it cannot be grown
Distributed cluster architecture scenario optimization solution: session sharing problem
【4】 Redis persistence (RDB and AOF)
Dataset类分批加载数据集
Installation and use of flinkx
如何选择小程序开发企业
No module named yum
【5】 Redis master-slave synchronization and redis sentinel (sentinel)
分布式集群架构场景化解决方案:集群时钟同步问题
Installing redis under Linux (centos7)
It's not easy to travel. You can use digital collections to brush the sense of existence in scenic spots
【一】redis简介
CertPathValidatorException:validity check failed
Flink CDC (MySQL as an example)
DataX安装及使用
Distributed cluster architecture scenario optimization solution: distributed ID solution
小程序制作小程序开发适合哪些企业?
【三】redis特点功能
微信小程序开发语言一般有哪些?
JS!!








