当前位置:网站首页>Pytorch custom CUDA operator tutorial and runtime analysis
Pytorch custom CUDA operator tutorial and runtime analysis
2022-07-27 08:58:00 【51CTO】
Recently, because of work needs , I learned a lot CUDA. Here is a brief record of PyTorch Customize CUDA The method of operator , I wrote a very simple example, Introduce the correct PyTorch in CUDA Runtime analysis method .
All the code is there github On , The address is :https://github.com/godweiyang/torch-cuda-example
Complete process
Let's take a closer look at PyTorch How to call custom CUDA Operator's .
First, we can see that there are four code files :
-
main.py, This is a python entrance , That's where you usually write models . -
add2.cpp, This is a torch and CUDA Where to connect , take CUDA The program is encapsulated into python Libraries that can be called . -
add2.h,CUDA Function declaration . -
add2.cu,CUDA Function implementation .
Then check the file by file to see how it is called .
CUDA Operator implementation
First of all, the simplest is add2.h and add2.cu, This is the ordinary CUDA Realization .
The functions realized here are two with a length of Of tensor Add up , Every block Yes 1024 Threads , Altogether individual block. Specifically CUDA Let's not talk about the details , This article does not focus on this .
add2_kernel yes kernel function , Running on the GPU Terminal . and launch_add2 yes CPU End of the execution function , call kernel. Note that it is asynchronous , After the call, control is immediately returned to CPU, So be careful when calculating the time later , It's easy to count only the call time .
Torch C++ encapsulation
What's involved here is add2.cpp, The main function of this file is to provide a PyTorch Interfaces that can be called .
torch_launch_add2 The function passes in C++ Version of torch tensor, And then convert to C++ Pointer array , call CUDA function launch_add2 To execute the kernel function .
Here we use pybind11 Come on torch_launch_add2 Function , And then use cmake Compiling can produce python It can be called .so library . But we don't directly manually cmake compile , See the following chapters for specific methods .
Python call
The last is python level , That is, our users write code to call the library generated above .
here 6-8 Yes torch.utils.cpp_extension.load The function is used to automatically compile the above cpp and cu Of documents . The main thing is sources Parameters , Specifies the list of files to compile . Then you can go through cuda_module.torch_launch_add2, That is, we use the encapsulated interface to call .
The next code will do whatever it wants , Here is a simple measurement of running time , Contrast and torch Speed code , This part is reserved for the next chapter .
To sum up , It is mainly divided into three modules :
- Write first CUDA Operator and corresponding calling function .
- Then write torch cpp Function creation PyTorch and CUDA The connection between , use pybind11 encapsulation .
- Last use PyTorch Of cpp The extension library is compiled and called .
Run time analysis
We know ,CUDA kernel Functions are asynchronous , So it can't be directly in CUDA Add time.time() Test time , In this way, only the call CUDA api Time for , barring GPU Time of end operation .
So we need to add thread synchronization function , wait for kernel Execute after all threads in CPU End subsequent instructions . Here we add the synchronization instruction to python End , It's using torch.cuda.synchronize function .
Specifically, it is shaped like the following code :
The first synchronization is to prevent any unsynchronization in the previous code GPU Instructions running on the end , The second synchronization is to wait fun() Count the time after all threads are executed .
Here we are torch and cuda Separately 10 Look at the average time , In addition, you need to execute before execution 10 Do it once warm up, Give Way GPU Reach normal state .
We test four cases , Namely :
- Double synchronization
- The first synchronization , The second time is out of sync
- Out of sync for the first time , Second synchronization
- Two out of sync
Here we use NVIDIA Nsight Systems To visualize the execution of instructions at each time of operation .
The installation command is :
And then it's running python Code , Add... Before the order nsys profile That's it :
And then it will generate report1.qdstrm and report1.sqlite Two documents , take report1.qdstrm Convert to report1.qdrep file :
What will be generated in the end report1.qdrep For documents Nsight Systems Software on , I am here mac System .
Double synchronization
This is the right way to count time , We turn on Nsight Systems, Zoom in kernel Run that section to see the following figure :

Among them the first 1 And the 3 The boxes are cuda and torch Of GPU warm up The process , There is no thread synchronization in this part ( The yellow block above ).
And the first 2 And the 4 The boxes are cuda and torch The addition process of , We can enlarge it to have a look .

It can be seen that , Every time ( A box ) It has gone through three steps : First, call api( Blue box in the upper left corner ), And then execute kernel( Blue box below ), Finally, thread synchronization ( The yellow box in the upper right corner ).
So the time finally calculated is the time-consuming of these three steps , That is, the selected range in the following figure :

It's about time 29us about , It is also relatively close to what we measured in the actual code :

In fact, the time we actually want to know does not include api The time when the call is synchronized with the thread , But this part of the time is python It's hard to remove , So I added .
The first synchronization , The second time is out of sync
Zoom in on each execution :

It can be seen that , Although it looks almost the same as the previous situation , But in api After calling , It's time now , So it takes only 8us about , The actual measured situation is also like this :

Out of sync for the first time , Second synchronization
Let's first look at the actual statistical time :

It's strange, isn't it , The first run took a long time , Let's visualize what's going on :

It can be seen that , Because there is no synchronization thread before the first timing , So in GPU warm up call api After the completion of , for the first time cuda kernel The call starts . Then wait until warm up completion of enforcement , Just started the first time cuda kernel, Then there is thread synchronization , The timing will not end until it is over . This process is very long , Is almost 130us about . Then it's normal to start execution the second time , because kernel The finished synchronization is equivalent to the synchronization before the next execution .
Two out of sync
Let's first look at the implementation :

It can be seen that there is no synchronization , all GPU warm up and cuda kernel Of api The calls are all connected , So is execution . So the timing only counts each api Call time , Almost 7us about .
The above four cases ,torch The instruction situation is almost the same , So I won't repeat it .
Summary
Through this article , Should be able to roughly understand PyTorch Implement customization CUDA Operator and call methods , You can also know how to measure correctly CUDA The time-consuming process .
Of course, there are some contents left for future explanation , For example, how to realize PyTorch Custom forward and back propagation of Neural Networks CUDA operator 、 How to use TensorFlow call CUDA Operators and so on .
- END -
I am a godweiyang, Department of computer science, East China Normal University , Bytes to beat AI Lab NLP Algorithm engineer , Qiuzhao won three big internet factories in Shanghai ssp offer, The main research direction is machine translation 、 Syntactic parsing 、 Model compression and acceleration . The biggest characteristic is a good temper 、 thole , If you have any questions, you can always consult me , Whether it's technical or life .

边栏推荐
- 苹果降价600元,对本就溃败的国产旗舰手机几乎是毁灭性打击
- “蔚来杯“2022牛客暑期多校训练营1
- New year's goals! The code is more standardized!
- Solve the problem of Chinese garbled code on the jupyter console
- String type and bitmap of redis
- The wechat installation package has soared from 0.5m to 260m. Why are our programs getting bigger and bigger?
- Is online account opening safe? Want to know how securities companies get preferential accounts?
- TensorFlow损失函数
- Matlab solves differential algebraic equations (DAE)
- 8 kinds of visual transformer finishing (Part 1)
猜你喜欢

Solve the problem of Chinese garbled code on the jupyter console

08_ Service fusing hystrix

一些实用、常用、效率越来越高的 Kubernetes 别名

The wechat installation package has soared from 0.5m to 260m. Why are our programs getting bigger and bigger?

Use of flask

PVT的spatial reduction attention(SRA)

MySQL Express

Network IO summary

Aruba learning notes 10 security authentication portal authentication (web page configuration)

NIO this.selector.select()
随机推荐
四个开源的人脸识别项目分享
Include error in vs Code (new header file)
Implementation of registration function
苹果降价600元,对本就溃败的国产旗舰手机几乎是毁灭性打击
说透缓存一致性与内存屏障
Query and association of flask to database
微信安装包从0.5M暴涨到260M,为什么我们的程序越来越大?
Aruba学习笔记10-安全认证-Portal认证(web页面配置)
微信安装包从0.5M暴涨到260M,为什么我们的程序越来越大?
数智革新
4278. Summit
Make a game by yourself with pyGame 01
NiO Summary - read and understand the whole NiO process
MATLAB data import -- importdata and load functions
Tensorflow loss function
PyQt5快速开发与实战 4.1 QMainWindow
如何在B站上快乐的学习?
User management - restrictions
CUDA Programming -03: thread level
Rewrite the tensorrt version deployment code of yolox