当前位置：网站首页>Pytorch custom CUDA operator tutorial and runtime analysis

Pytorch custom CUDA operator tutorial and runtime analysis

2022-07-27 08:58:00 【51CTO】

Recently, because of work needs , I learned a lot CUDA. Here is a brief record of PyTorch Customize CUDA The method of operator , I wrote a very simple example, Introduce the correct PyTorch in CUDA Runtime analysis method .

All the code is there github On , The address is ：https://github.com/godweiyang/torch-cuda-example

Complete process

Let's take a closer look at PyTorch How to call custom CUDA Operator's .

First, we can see that there are four code files ：

main.py, This is a python entrance , That's where you usually write models .
add2.cpp, This is a torch and CUDA Where to connect , take CUDA The program is encapsulated into python Libraries that can be called .
add2.h,CUDA Function declaration .
add2.cu,CUDA Function implementation .

Then check the file by file to see how it is called .

CUDA Operator implementation

First of all, the simplest is add2.h and add2.cu, This is the ordinary CUDA Realization .

       
       void 
       
       launch_add2(
       
       float 
       
       *
       
       c,
       
       const 
       
       float 
       
       *
       
       a,
       
       const 
       
       float 
       
       *
       
       b,
       
       int 
       
       n);
      
1.
2.
3.
4.

       
       __global__ 
       
       void 
       
       add2_kernel(
       
       float
       
       * 
       
       c,
       
       const 
       
       float
       
       * 
       
       a,
       
       const 
       
       float
       
       * 
       
       b,
       
       int 
       
       n) {
       
       for (
       
       int 
       
       i 
       
       = 
       
       blockIdx
       
       .
       
       x 
       
       * 
       
       blockDim
       
       .
       
       x 
       
       + 
       
       threadIdx
       
       .
       
       x; 
       
       \
       
       i 
       
       < 
       
       n; 
       
       i 
       
       += 
       
       gridDim
       
       .
       
       x 
       
       * 
       
       blockDim
       
       .
       
       x) {
       
       c[
       
       i] 
       
       = 
       
       a[
       
       i] 
       
       + 
       
       b[
       
       i];
       
    }
       
}
       
       void 
       
       launch_add2(
       
       float
       
       * 
       
       c,
       
       const 
       
       float
       
       * 
       
       a,
       
       const 
       
       float
       
       * 
       
       b,
       
       int 
       
       n) {
       
       dim3 
       
       grid((
       
       n 
       
       + 
       
       1023) 
       
       / 
       
       1024);
       
       dim3 
       
       block(
       
       1024);
       
       add2_kernel
       
       <<<
       
       grid, 
       
       block
       
       >>>(
       
       c, 
       
       a, 
       
       b, 
       
       n);
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.

The functions realized here are two with a length of Of tensor Add up , Every block Yes 1024 Threads , Altogether individual block. Specifically CUDA Let's not talk about the details , This article does not focus on this .

add2_kernel yes kernel function , Running on the GPU Terminal . and launch_add2 yes CPU End of the execution function , call kernel. Note that it is asynchronous , After the call, control is immediately returned to CPU, So be careful when calculating the time later , It's easy to count only the call time .

Torch C++ encapsulation

What's involved here is add2.cpp, The main function of this file is to provide a PyTorch Interfaces that can be called .

       
       #include 
       
       <
       
       torch
       
       /
       
       extension
       
       .
       
       h
       
       >
       
       #include 
       
       "add2.h"
       
       void 
       
       torch_launch_add2(
       
       torch::
       
       Tensor 
       
       &
       
       c,
       
       const 
       
       torch::
       
       Tensor 
       
       &
       
       a,
       
       const 
       
       torch::
       
       Tensor 
       
       &
       
       b,
       
       int 
       
       n) {
       
       launch_add2((
       
       float 
       
       *)
       
       c
       
       .
       
       data_ptr(),
       
                (
       
       const 
       
       float 
       
       *)
       
       a
       
       .
       
       data_ptr(),
       
                (
       
       const 
       
       float 
       
       *)
       
       b
       
       .
       
       data_ptr(),
       
       n);
       
}
       
       PYBIND11_MODULE(
       
       TORCH_EXTENSION_NAME, 
       
       m) {
       
       m
       
       .
       
       def(
       
       "torch_launch_add2",
       
       &
       
       torch_launch_add2,
       
       "add2 kernel warpper");
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.

torch_launch_add2 The function passes in C++ Version of torch tensor, And then convert to C++ Pointer array , call CUDA function launch_add2 To execute the kernel function .

Here we use pybind11 Come on torch_launch_add2 Function , And then use cmake Compiling can produce python It can be called .so library . But we don't directly manually cmake compile , See the following chapters for specific methods .

Python call

The last is python level , That is, our users write code to call the library generated above .

       
       import 
       
       time
       
       import 
       
       numpy 
       
       as 
       
       np
       
       import 
       
       torch
       
       from 
       
       torch
       
       .
       
       utils
       
       .
       
       cpp_extension 
       
       import 
       
       load
       
       cuda_module 
       
       = 
       
       load(
       
       name
       
       =
       
       "add2",
       
       sources
       
       =[
       
       "add2.cpp", 
       
       "add2.cu"],
       
       verbose
       
       =
       
       True)
       
       # 
       
       c 
       
       = 
       
       a 
       
       + 
       
       b (
       
       shape: [
       
       n])
       
       n 
       
       = 
       
       1024 
       
       * 
       
       1024
       
       a 
       
       = 
       
       torch
       
       .
       
       rand(
       
       n, 
       
       device
       
       =
       
       "cuda:0")
       
       b 
       
       = 
       
       torch
       
       .
       
       rand(
       
       n, 
       
       device
       
       =
       
       "cuda:0")
       
       cuda_c 
       
       = 
       
       torch
       
       .
       
       rand(
       
       n, 
       
       device
       
       =
       
       "cuda:0")
       
       ntest 
       
       = 
       
       10
       
       def 
       
       show_time(
       
       func):
       
       times 
       
       = 
       
       list()
       
       res 
       
       = 
       
       list()
       
       # 
       
       GPU 
       
       warm 
       
       up
       
       for 
       
       _ 
       
       in 
       
       range(
       
       10):
       
       func()
       
       for 
       
       _ 
       
       in 
       
       range(
       
       ntest):
       
       # 
       
       sync 
       
       the 
       
       threads 
       
       to 
       
       get 
       
       accurate 
       
       cuda 
       
       running 
       
       time
       
       torch
       
       .
       
       cuda
       
       .
       
       synchronize(
       
       device
       
       =
       
       "cuda:0")
       
       start_time 
       
       = 
       
       time
       
       .
       
       time()
       
       r 
       
       = 
       
       func()
       
       torch
       
       .
       
       cuda
       
       .
       
       synchronize(
       
       device
       
       =
       
       "cuda:0")
       
       end_time 
       
       = 
       
       time
       
       .
       
       time()
       
       times
       
       .
       
       append((
       
       end_time
       
       -
       
       start_time)
       
       *
       
       1e6)
       
       res
       
       .
       
       append(
       
       r)
       
       return 
       
       times, 
       
       res
       
       def 
       
       run_cuda():
       
       cuda_module
       
       .
       
       torch_launch_add2(
       
       cuda_c, 
       
       a, 
       
       b, 
       
       n)
       
       return 
       
       cuda_c
       
       def 
       
       run_torch():
       
       # 
       
       return 
       
       None 
       
       to 
       
       avoid 
       
       intermediate 
       
       GPU 
       
       memory 
       
       application
       
       # 
       
       for 
       
       accurate 
       
       time 
       
       statistics
       
       a 
       
       + 
       
       b
       
       return 
       
       None
       
       print(
       
       "Running cuda...")
       
       cuda_time, 
       
       _ 
       
       = 
       
       show_time(
       
       run_cuda)
       
       print(
       
       "Cuda time: {:.3f}us"
       
       .
       
       format(
       
       np
       
       .
       
       mean(
       
       cuda_time)))
       
       print(
       
       "Running torch...")
       
       torch_time, 
       
       _ 
       
       = 
       
       show_time(
       
       run_torch)
       
       print(
       
       "Torch time: {:.3f}us"
       
       .
       
       format(
       
       np
       
       .
       
       mean(
       
       torch_time)))
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.

here 6-8 Yes torch.utils.cpp_extension.load The function is used to automatically compile the above cpp and cu Of documents . The main thing is sources Parameters , Specifies the list of files to compile . Then you can go through cuda_module.torch_launch_add2, That is, we use the encapsulated interface to call .

The next code will do whatever it wants , Here is a simple measurement of running time , Contrast and torch Speed code , This part is reserved for the next chapter .

To sum up , It is mainly divided into three modules ：

Write first CUDA Operator and corresponding calling function .
Then write torch cpp Function creation PyTorch and CUDA The connection between , use pybind11 encapsulation .
Last use PyTorch Of cpp The extension library is compiled and called .

Run time analysis

We know ,CUDA kernel Functions are asynchronous , So it can't be directly in CUDA Add time.time() Test time , In this way, only the call CUDA api Time for , barring GPU Time of end operation .

So we need to add thread synchronization function , wait for kernel Execute after all threads in CPU End subsequent instructions . Here we add the synchronization instruction to python End , It's using torch.cuda.synchronize function .

Specifically, it is shaped like the following code ：

       
       torch
       
       .
       
       cuda
       
       .
       
       synchronize()
       
       start_time 
       
       = 
       
       time
       
       .
       
       time()
       
       func()
       
       torch
       
       .
       
       cuda
       
       .
       
       synchronize()
       
       end_time 
       
       = 
       
       time
       
       .
       
       time()
      
1.
2.
3.
4.
5.

The first synchronization is to prevent any unsynchronization in the previous code GPU Instructions running on the end , The second synchronization is to wait fun() Count the time after all threads are executed .

Here we are torch and cuda Separately 10 Look at the average time , In addition, you need to execute before execution 10 Do it once warm up, Give Way GPU Reach normal state .

We test four cases , Namely ：

Double synchronization
The first synchronization , The second time is out of sync
Out of sync for the first time , Second synchronization
Two out of sync

Here we use NVIDIA Nsight Systems To visualize the execution of instructions at each time of operation .

The installation command is ：

       
       sudo 
       
       apt 
       
       install 
       
       nsight
       
       -
       
       systems
      
1.

And then it's running python Code , Add... Before the order nsys profile That's it ：

And then it will generate report1.qdstrm and report1.sqlite Two documents , take report1.qdstrm Convert to report1.qdrep file ：

       
       QdstrmImporter 
       
       -
       
       i 
       
       report1
       
       .
       
       qdstrm
      
1.

What will be generated in the end report1.qdrep For documents Nsight Systems Software on , I am here mac System .

Double synchronization

This is the right way to count time , We turn on Nsight Systems, Zoom in kernel Run that section to see the following figure ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _java

Among them the first 1 And the 3 The boxes are cuda and torch Of GPU warm up The process , There is no thread synchronization in this part （ The yellow block above ）.

And the first 2 And the 4 The boxes are cuda and torch The addition process of , We can enlarge it to have a look .

PyTorch Customize CUDA Operator tutorial and runtime analysis _java_02

It can be seen that , Every time （ A box ） It has gone through three steps ： First, call api（ Blue box in the upper left corner ）, And then execute kernel（ Blue box below ）, Finally, thread synchronization （ The yellow box in the upper right corner ）.

So the time finally calculated is the time-consuming of these three steps , That is, the selected range in the following figure ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _ programing language _03

It's about time 29us about , It is also relatively close to what we measured in the actual code ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _ Artificial intelligence _04

In fact, the time we actually want to know does not include api The time when the call is synchronized with the thread , But this part of the time is python It's hard to remove , So I added .

The first synchronization , The second time is out of sync

Zoom in on each execution ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _ programing language _05

It can be seen that , Although it looks almost the same as the previous situation , But in api After calling , It's time now , So it takes only 8us about , The actual measured situation is also like this ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _python_06

Out of sync for the first time , Second synchronization

Let's first look at the actual statistical time ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _java_07

It's strange, isn't it , The first run took a long time , Let's visualize what's going on ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _java_08

It can be seen that , Because there is no synchronization thread before the first timing , So in GPU warm up call api After the completion of , for the first time cuda kernel The call starts . Then wait until warm up completion of enforcement , Just started the first time cuda kernel, Then there is thread synchronization , The timing will not end until it is over . This process is very long , Is almost 130us about . Then it's normal to start execution the second time , because kernel The finished synchronization is equivalent to the synchronization before the next execution .

Two out of sync

Let's first look at the implementation ：

PyTorch Customize CUDA Operator tutorial and runtime analysis _java_09

It can be seen that there is no synchronization , all GPU warm up and cuda kernel Of api The calls are all connected , So is execution . So the timing only counts each api Call time , Almost 7us about .

The above four cases ,torch The instruction situation is almost the same , So I won't repeat it .

Summary

Through this article , Should be able to roughly understand PyTorch Implement customization CUDA Operator and call methods , You can also know how to measure correctly CUDA The time-consuming process .

Of course, there are some contents left for future explanation , For example, how to realize PyTorch Custom forward and back propagation of Neural Networks CUDA operator 、 How to use TensorFlow call CUDA Operators and so on .

- END -

I am a godweiyang, Department of computer science, East China Normal University , Bytes to beat AI Lab NLP Algorithm engineer , Qiuzhao won three big internet factories in Shanghai ssp offer, The main research direction is machine translation 、 Syntactic parsing 、 Model compression and acceleration . The biggest characteristic is a good temper 、 thole , If you have any questions, you can always consult me , Whether it's technical or life .

PyTorch Customize CUDA Operator tutorial and runtime analysis _python_10