当前位置:网站首页>Pytorch custom CUDA operator tutorial and runtime analysis
Pytorch custom CUDA operator tutorial and runtime analysis
2022-07-27 08:58:00 【51CTO】
Recently, because of work needs , I learned a lot CUDA. Here is a brief record of PyTorch Customize CUDA The method of operator , I wrote a very simple example, Introduce the correct PyTorch in CUDA Runtime analysis method .
All the code is there github On , The address is :https://github.com/godweiyang/torch-cuda-example
Complete process
Let's take a closer look at PyTorch How to call custom CUDA Operator's .
First, we can see that there are four code files :
-
main.py, This is a python entrance , That's where you usually write models . -
add2.cpp, This is a torch and CUDA Where to connect , take CUDA The program is encapsulated into python Libraries that can be called . -
add2.h,CUDA Function declaration . -
add2.cu,CUDA Function implementation .
Then check the file by file to see how it is called .
CUDA Operator implementation
First of all, the simplest is add2.h and add2.cu, This is the ordinary CUDA Realization .
The functions realized here are two with a length of Of tensor Add up , Every block Yes 1024 Threads , Altogether individual block. Specifically CUDA Let's not talk about the details , This article does not focus on this .
add2_kernel yes kernel function , Running on the GPU Terminal . and launch_add2 yes CPU End of the execution function , call kernel. Note that it is asynchronous , After the call, control is immediately returned to CPU, So be careful when calculating the time later , It's easy to count only the call time .
Torch C++ encapsulation
What's involved here is add2.cpp, The main function of this file is to provide a PyTorch Interfaces that can be called .
torch_launch_add2 The function passes in C++ Version of torch tensor, And then convert to C++ Pointer array , call CUDA function launch_add2 To execute the kernel function .
Here we use pybind11 Come on torch_launch_add2 Function , And then use cmake Compiling can produce python It can be called .so library . But we don't directly manually cmake compile , See the following chapters for specific methods .
Python call
The last is python level , That is, our users write code to call the library generated above .
here 6-8 Yes torch.utils.cpp_extension.load The function is used to automatically compile the above cpp and cu Of documents . The main thing is sources Parameters , Specifies the list of files to compile . Then you can go through cuda_module.torch_launch_add2, That is, we use the encapsulated interface to call .
The next code will do whatever it wants , Here is a simple measurement of running time , Contrast and torch Speed code , This part is reserved for the next chapter .
To sum up , It is mainly divided into three modules :
- Write first CUDA Operator and corresponding calling function .
- Then write torch cpp Function creation PyTorch and CUDA The connection between , use pybind11 encapsulation .
- Last use PyTorch Of cpp The extension library is compiled and called .
Run time analysis
We know ,CUDA kernel Functions are asynchronous , So it can't be directly in CUDA Add time.time() Test time , In this way, only the call CUDA api Time for , barring GPU Time of end operation .
So we need to add thread synchronization function , wait for kernel Execute after all threads in CPU End subsequent instructions . Here we add the synchronization instruction to python End , It's using torch.cuda.synchronize function .
Specifically, it is shaped like the following code :
The first synchronization is to prevent any unsynchronization in the previous code GPU Instructions running on the end , The second synchronization is to wait fun() Count the time after all threads are executed .
Here we are torch and cuda Separately 10 Look at the average time , In addition, you need to execute before execution 10 Do it once warm up, Give Way GPU Reach normal state .
We test four cases , Namely :
- Double synchronization
- The first synchronization , The second time is out of sync
- Out of sync for the first time , Second synchronization
- Two out of sync
Here we use NVIDIA Nsight Systems To visualize the execution of instructions at each time of operation .
The installation command is :
And then it's running python Code , Add... Before the order nsys profile That's it :
And then it will generate report1.qdstrm and report1.sqlite Two documents , take report1.qdstrm Convert to report1.qdrep file :
What will be generated in the end report1.qdrep For documents Nsight Systems Software on , I am here mac System .
Double synchronization
This is the right way to count time , We turn on Nsight Systems, Zoom in kernel Run that section to see the following figure :

Among them the first 1 And the 3 The boxes are cuda and torch Of GPU warm up The process , There is no thread synchronization in this part ( The yellow block above ).
And the first 2 And the 4 The boxes are cuda and torch The addition process of , We can enlarge it to have a look .

It can be seen that , Every time ( A box ) It has gone through three steps : First, call api( Blue box in the upper left corner ), And then execute kernel( Blue box below ), Finally, thread synchronization ( The yellow box in the upper right corner ).
So the time finally calculated is the time-consuming of these three steps , That is, the selected range in the following figure :

It's about time 29us about , It is also relatively close to what we measured in the actual code :

In fact, the time we actually want to know does not include api The time when the call is synchronized with the thread , But this part of the time is python It's hard to remove , So I added .
The first synchronization , The second time is out of sync
Zoom in on each execution :

It can be seen that , Although it looks almost the same as the previous situation , But in api After calling , It's time now , So it takes only 8us about , The actual measured situation is also like this :

Out of sync for the first time , Second synchronization
Let's first look at the actual statistical time :

It's strange, isn't it , The first run took a long time , Let's visualize what's going on :

It can be seen that , Because there is no synchronization thread before the first timing , So in GPU warm up call api After the completion of , for the first time cuda kernel The call starts . Then wait until warm up completion of enforcement , Just started the first time cuda kernel, Then there is thread synchronization , The timing will not end until it is over . This process is very long , Is almost 130us about . Then it's normal to start execution the second time , because kernel The finished synchronization is equivalent to the synchronization before the next execution .
Two out of sync
Let's first look at the implementation :

It can be seen that there is no synchronization , all GPU warm up and cuda kernel Of api The calls are all connected , So is execution . So the timing only counts each api Call time , Almost 7us about .
The above four cases ,torch The instruction situation is almost the same , So I won't repeat it .
Summary
Through this article , Should be able to roughly understand PyTorch Implement customization CUDA Operator and call methods , You can also know how to measure correctly CUDA The time-consuming process .
Of course, there are some contents left for future explanation , For example, how to realize PyTorch Custom forward and back propagation of Neural Networks CUDA operator 、 How to use TensorFlow call CUDA Operators and so on .
- END -
I am a godweiyang, Department of computer science, East China Normal University , Bytes to beat AI Lab NLP Algorithm engineer , Qiuzhao won three big internet factories in Shanghai ssp offer, The main research direction is machine translation 、 Syntactic parsing 、 Model compression and acceleration . The biggest characteristic is a good temper 、 thole , If you have any questions, you can always consult me , Whether it's technical or life .

边栏推荐
- CUDA programming-02: first knowledge of CUDA Programming
- “蔚来杯“2022牛客暑期多校训练营1
- [I2C reading mpu6050 of Renesas ra6m4 development board]
- PyQt5快速开发与实战 4.1 QMainWindow
- 存储和计算引擎
- 微信安装包从0.5M暴涨到260M,为什么我们的程序越来越大?
- 4274. 后缀表达式
- JWT authentication and login function implementation, exit login
- JS detects whether the client software is installed
- Kibana uses JSON document data
猜你喜欢

Unity3d 2021 software installation package download and installation tutorial

4276. Good at C

Solution of database migration error

4279. Cartesian tree

The wechat installation package has soared from 0.5m to 260m. Why are our programs getting bigger and bigger?

The shelf life you filled in has been less than 10 days until now, and it is not allowed to publish. If the actual shelf life is more than 10 days, please truthfully fill in the production date and pu

View 的滑动冲突

NiO Summary - read and understand the whole NiO process

【Flutter -- GetX】准备篇

What are the differences or similarities between "demand fulfillment to settlement" and "purchase to payment"?
随机推荐
3428. Put apples
网络IO总结文
Tensorflow package tf.keras module construction and training deep learning model
Pass parameters and returned responses of flask
【进程间通信IPC】- 信号量的学习
4278. 峰会
5G没能拉动行业发展,不仅运营商失望了,手机企业也失望了
Matlab 利用M文件产生模糊控制器
Include error in vs Code (new header file)
CUDA programming-05: flows and events
Five kinds of 2D attention finishing (non local, criss cross, Se, CBAM, dual attention)
微信安装包从0.5M暴涨到260M,为什么我们的程序越来越大?
[penetration test tool sharing] [dnslog server building guidance]
微信安装包从0.5M暴涨到260M,为什么我们的程序越来越大?
A survey of robust lidar based 3D object detection methods for autonomous driving paper notes
苹果降价600元,对本就溃败的国产旗舰手机几乎是毁灭性打击
3311. Longest arithmetic
Horse walking oblique sun (backtracking method)
Arm system call exception assembly
Redis network IO