当前位置:网站首页>[tensorrt] convert pytorch into deployable tensorrt
[tensorrt] convert pytorch into deployable tensorrt
2022-07-29 06:03:00 【Dull cat】
List of articles

In the process of deep learning model landing , You will face the problem of deploying the model to edge devices , Model training uses different frameworks , Then we need to use the same framework when reasoning , But different types of platforms , Tuning and implementation are very difficult , Because each platform has different functions and features . If you need to run multiple frameworks on this platform , Will increase complexity , therefore ONNX It's useful . You can convert models trained in different frameworks into general ONNX Model , And then convert it into the formats supported by various platforms , You can simplify deployment .
One 、 What is? ONNX
ONNX yes Open Neural Network Exchange For short , Also called open neural network switching , Is a standard for representing deep learning models , The model can be directly converted in different frameworks .
ONNX It's the first step towards an open ecosystem , So that developers are not limited to a specific development tool , Open source format is provided for the model .
ONNX Currently supported frameworks are :Caffe2、PyTorch、TensorFlow、MXNet、TensorRT、CNTK etc.
ONNX Generally speaking, it is an intermediary , It's a means , Transforming the model into ONNX after , And then convert it into a deployable form , Such as TensorRT.
Typical structural transformation route :
- Pytorch → ONNX → TensorRT
- Pytorch → ONNX → TVM
- TF → ONNX → NCNN
Two 、PyTorch turn ONNX
import onnxruntime
import torch
torch.onnx.export(
model,
(img_list, ),
'tmp.onnx',
input_names=['input.1'],
output_names=['output'],
export_params=True,
keep_initializers_as_inputs=False,
verbose=show,
opset_version=opset_version,
dynamic_axes=dynamic_axes))
# onnx Model simplification :
python3 -m onnxsim tmp.onnx tmp_simplify.onnx
3、 ... and 、 What is? TensorRT
TensorRT It's a high-performance deep learning reasoning (inference) Optimizer , Yes, you can. NVIDIA Various GPU A running under the hardware platform C++ The frame of reasoning , Low latency can be improved for deep learning 、 High throughput deployment reasoning , It can be used for embedded platforms 、 The reasoning of the autopilot platform accelerates .
take TensorRT and NVIDIA Of GPU Combine , It can carry out rapid and efficient deployment reasoning in almost all frameworks . So as to improve this model in NVIDIA GPU The speed of running on the computer . The proportion of speed increase is considerable .
We know that the model includes two stages of training and reasoning , Training includes forward propagation and back propagation , Reasoning only involves forward propagation , So the speed of prediction is more important .
During training , Generally, it will be used more GPU Distributed training , When deploying reasoning , Often use a single GPU Even embedded platforms . The sampling framework during model training will be different , The performance of different machines will vary , Cause reasoning speed to slow down , Cannot meet high real-time . and TensorRT Is the inference optimizer , hold ONNX The model is converted to TensorRT after , You can deploy on the relevant side .
TensorRT Optimization method :
TensorRT There are many optimization methods , The first two are the most important :
Interlaminar fusion or tensor fusion :
TensorRT Through horizontal or vertical merging between layers ( The combined structure is called CBR, Signification convolution, bias, and ReLU layers are fused to form a single layer), The number of layers is greatly reduced . Horizontal merging can make convolution 、 The offset and active layers are combined into one CBR structure , Only one CUDA The core . Vertical merging can make the structure the same , But layers with different weights are merged into a wider layer , Only one CUDA The core . The merged calculation chart ) There are fewer levels of , The amount of CUDA The number of cores is also less , Therefore, the whole model structure will be smaller , faster , More efficient .
Data accuracy calibration :
When most deep learning frameworks train neural networks, the tensor in the network is 32 The precision of floating point numbers (Full 32-bit precision,FP32), Once the network training is completed , In the process of deploying reasoning, there is no need for back propagation , It can reduce the data accuracy properly , For example, it is reduced to FP16 or INT8 The accuracy of the . Lower data accuracy will result in lower memory usage and latency , The model is smaller .
Kernel Auto-Tuning:
When the network model is reasoning and calculating , Is to call GPU Of CUDA computationally ,TensorRT Can really different algorithms 、 Different model structures 、 Different GPU Platform, etc , Conduct CUDA adjustment , To ensure that the current model calculates with optimal performance on a specific platform .
Suppose that 3090 and T4 Deploy separately , It needs to be carried out on these two platforms TensorRT Transformation , Then use on the corresponding platform , It cannot be converted on the same platform , Use on different platforms .
Dynamic Tensor Memory:
At every tensor Use period ,TensorRT It will be assigned a memory , Avoid duplicate applications for video storage , Reduce memory usage and improve reuse efficiency
Four 、ONNX turn TensorRT
def convert_tensorrt_engine(onnx_fn, trt_fn, max_batch_size, fp16=True, int8_calibrator=None, workspace=2_000_000_000):
network_creation_flag = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
with trt.Builder(TRT_LOGGER) as builder,
builder.create_network(network_creation_flag) as network,
trt.OnnxParser(network, TRT_LOGGER) as parser:
builder.max_workspace_size = workspace
builder.max_batch_size = max_batch_size
builder.fp16_mode = fp16
if int8_calibrator:
builder.int8_mode = True
builder.int8_calibrator = int8_calibrator
with open(onnx_fn, "rb") as f:
if not parser.parse(f.read()):
print("got {} errors: ".format(parser.num_errors))
for i in range(parser.num_errors):
e = parser.get_error(i)
print(e.code(), e.desc(), e.node())
return
else:
print("parse successful")
print("inputs: ", network.num_inputs)
# inputs = [network.get_input(i) for i in range(network.num_inputs)]
# opt_profiles = create_optimization_profiles(builder, inputs)
# add_profiles(config, inputs, opt_profiles)
for i in range(network.num_inputs):
print(i, network.get_input(i).name, network.get_input(i).shape)
print("outputs: ", network.num_outputs)
for i in range(network.num_outputs):
output = network.get_output(i)
print(i, output.name, output.shape)
engine = builder.build_cuda_engine(network)
with open(trt_fn, "wb") as f:
f.write(engine.serialize())
print("done")
边栏推荐
- 研究生新生培训第一周:深度学习和pytorch基础
- 【网络设计】ConvNeXt:A ConvNet for the 2020s
- Personal learning website
- [image classification] how to use mmclassification to train your classification model
- Windos下安装pyspider报错:Please specify --curl-dir=/path/to/built/libcurl解决办法
- Detailed explanation of tool classes countdownlatch and cyclicbarrier of concurrent programming learning notes
- Android Studio 实现登录注册-源代码 (连接MySql数据库)
- File文件上传的使用(2)--上传到阿里云Oss文件服务器
- Reporting Services- Web Service
- 微信小程序源码获取(附工具的下载)
猜你喜欢

PHP write a diaper to buy the lowest price in the whole network
![[database] database course design - vaccination database](/img/4d/e8aff67e3c643fae651c9f62af2db9.png)
[database] database course design - vaccination database

Use of file upload (2) -- upload to Alibaba cloud OSS file server

【Transformer】TransMix: Attend to Mix for Vision Transformers

Ribbon learning notes II

ssm整合

Reporting Services- Web Service

【Transformer】ACMix:On the Integration of Self-Attention and Convolution

【目标检测】Generalized Focal Loss V1

研究生新生培训第一周:深度学习和pytorch基础
随机推荐
datax安装
【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer
并发编程学习笔记 之 ReentrantLock实现原理的探究
【CV】请问卷积核(滤波器)3*3、5*5、7*7、11*11 都是具体什么数?
【Transformer】ACMix:On the Integration of Self-Attention and Convolution
Operation commands in anaconda, such as removing old environment, adding new environment, viewing environment, installing library, cleaning cache, etc
Thinkphp6 pipeline mode pipeline use
第三周周报 ResNet+ResNext
C # judge whether the user accesses by mobile phone or computer
DataX installation
Anr Optimization: cause oom crash and corresponding solutions
Activity交互问题,你确定都知道?
A preliminary study on fastjason's autotype
【卷积核设计】Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
Windos下安装pyspider报错:Please specify --curl-dir=/path/to/built/libcurl解决办法
Isaccessible() method: use reflection techniques to improve your performance several times
Synchronous development with open source projects & codereview & pull request & Fork how to pull the original warehouse
研究生新生培训第二周:卷积神经网络基础
研究生新生培训第三周:ResNet+ResNeXt
【综述】图像分类网络