当前位置：网站首页>Introduction to CUDA Programming minimalist tutorial

Introduction to CUDA Programming minimalist tutorial

2022-06-25 02:50:00 【Startling Hong Yibo】

from CUDA A minimalist tutorial on getting started with programming - You know

Preface

2006 year ,NVIDIA The company released CUDA,CUDA yes Based on the NVIDIA Of CPUs A general parallel computing platform and programming model on , be based on CUDA Programming can take advantage of GPUs Parallel computing engine for , To solve more complex computing problems more efficiently . In recent years ,GPU One of the most successful applications is the field of deep learning , be based on GPU Parallel computing has become the standard configuration of training deep learning model . at present , Abreast of the times CUDA Version is CUDA 9.

GPU It is not an independent computing platform , And need to work with CPU Working together , It can be seen as CPU The coprocessor , So when we're saying GPU In parallel computing , Actually, it means based on CPU+GPU Heterogeneous computing architecture . In heterogeneous computing architecture ,GPU And CPU adopt PCIe Bus Connect and work together ,CPU The location is called the host side （host）, and GPU The location is called the equipment end （device）, As shown in the figure below .

be based on CPU+GPU Heterogeneous computing . source ：Preofessional CUDA C Programming

You can see GPU Including more computing cores , It is especially suitable for data parallel computing intensive tasks , Such as large matrix operation , and CPU There are fewer computing cores , But it can realize complex logical operations , Therefore, it is suitable for controlling intensive tasks . in addition ,CPU Upper Threads are heavyweight , Context switching is expensive , however GPU Due to the existence A lot of cores , Its threads are lightweight . therefore , be based on CPU+GPU Heterogeneous computing platforms can complement each other ,CPU Responsible for dealing with complex logic Serial program , and GPU Focus on data intensive Parallel computing Program , So as to maximize the effect .

be based on CPU+GPU Heterogeneous computing application execution logic . source ：Preofessional CUDA C Programming

CUDA yes NVIDIA Developed by the company GPU Programming model , It provides GPU Simple programming interface , be based on CUDA Programming can build on GPU Computing applications .CUDA Provides support for other programming languages , Such as C/C++,Python,Fortran Other languages , Here we choose CUDA C/C++ The interface of CUDA Explain programming . The development platform is Windows 10 + VS 2013,Windows Under the system CUDA Installation tutorial can refer to here .

CUDA Programming languages supported by the programming model

CUDA Fundamentals of programming model

In the given CUDA Before the programming instance of , First of all, here CUDA Some concepts and basic knowledge in programming model are briefly introduced .CUDA The programming model is a heterogeneous model , need CPU and GPU Working together . stay CUDA in ,host and device It's two important concepts , We use it host Refer to CPU And its memory , While using device Refer to GPU And its memory .CUDA The program contains both host Program , Contain, device Program , They are in CPU and GPU Up operation . meanwhile ,host And device Can communicate with each other , In this way, data can be copied between them .

Typical CUDA The execution process of the program is as follows

Distribute host Memory , And data initialization ;
Distribute device Memory , and from host Copy data to device On ;
call CUDA Kernel function of stay device Complete the specified operation on ;
take device Copy the result of the operation on to host On ;
Release device and host Memory allocated on .

The most important procedure in the above process is to call CUDA Kernel function to perform parallel computing ,kernel yes CUDA An important concept in ,kernel Is in device Functions executed in parallel in the upper thread , The kernel function uses __global__ Symbol declaration , When calling, you need to use <<<grid, block>>> To specify the kernel Number of threads to execute , stay CUDA in , Each thread executes a kernel function , And each thread will be assigned a unique thread number thread ID, This ID The value can be passed through the built-in variable of the kernel function threadIdx To obtain a .

because GPU It's actually a heterogeneous model , So we need to distinguish host and device The code on , stay CUDA Is distinguished by function type qualifiers host and device The function on , The main three Function type qualifier as follows ：

__global__： stay device On the implementation , from host Call in （ Some specific GPU You can also get it from device On the call ）, The return type must be void, Variable parameters are not supported , Cannot be a class member function . Use it carefully __global__ Defined kernel It's asynchronous , It means host Don't wait for kernel After execution, proceed to the next step .
__device__： stay device On the implementation , Only from device Call in , Not with __global__ Simultaneous use .
__host__： stay host On the implementation , Only from host On the call , Generally omit and do not write , Not with __global__ Simultaneous use , But it can be compared with __device__, At this point, the function will be in device and host All compile .

Have a deep understanding of kernel, You have to deal with kernel Have a clear understanding of the thread hierarchy . First GPU There are many parallelized lightweight threads on the .kernel stay device When executing on, it actually starts many threads , One kernel All threads that are started are called a grid （grid）, Threads on the same grid share the same global memory space ,grid Is the first level of thread structure , And the grid can be divided into many Thread block （block）, A thread block contains many threads , This is the second level . The two-tier organization structure of threads is shown in the figure below , This is a gird and block Are all 2-dim Thread organization .grid and block Are defined as dim3 Variable of type ,dim3 It can be regarded as containing three unsigned integers （x,y,z） Member's structure variable , In defining , The default value is initialized to 1. therefore grid and block Can be flexibly defined as 1-dim,2-dim as well as 3-dim structure , For the structure in the figure （ The main horizontal direction is x Axis ）, Defined grid and block As shown below ,kernel When calling, you must also pass Perform configuration <<<grid, block>>> To specify the kernel Number and structure of threads used .

dim3 grid(3, 2);
dim3 block(5, 3);
kernel_fun<<< grid, block >>>(prams...);

Kernel A two-tier thread organization on （2-dim）

therefore , A thread needs two built-in coordinate variables （blockIdx,threadIdx） To uniquely identify , They are all dim3 Type variable , among blockIdx Indicates where the thread is located grid Position in , and threaIdx Indicates where the thread is located block Position in , As shown in figure of Thread (1,1) Satisfy ：

threadIdx.x = 1
threadIdx.y = 1
blockIdx.x = 1
blockIdx.y = 1

Threads on a thread block are placed on the same streaming multiprocessor （SM) Upper , But individually SM Our resources are limited , This results in a limited number of threads in the thread block , modern GPUs The thread block can support up to 1024 individual . occasionally , We need to know that a thread is blcok The overall situation in ID, At this time, we must also know block Organizational structure of , This is through the thread's built-in variables blockDim To obtain a . It gets the size of each dimension of the thread block . For one 2-dim Of block Failed to re upload and cancel the transfer , Threads Of ID The value is , If it is 3-dim Of block , Threads Of ID The value is . In addition, threads have built-in variables gridDim, Used to obtain the size of each dimension of the grid block .

kernel This thread structure is naturally suitable for vector,matrix Such as operation , For example, we will use the above figure 2-dim Structure to realize the addition of two matrices , Each thread is responsible for processing the addition of two elements at each location , The code is as follows . The thread block size is (16, 16), And then N*N The size of the matrix is divided into different thread blocks to perform addition .

// Kernel Definition 
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) 
{ 
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    int j = blockIdx.y * blockDim.y + threadIdx.y; 
    if (i < N && j < N) 
        C[i][j] = A[i][j] + B[i][j]; 
}
int main() 
{ 
    ...
    // Kernel  Thread configuration 
    dim3 threadsPerBlock(16, 16); 
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    // kernel call 
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); 
    ...
}

In addition, here is a brief introduction CUDA Memory model , As shown in the figure below . You can see , Each thread has its own private local memory （Local Memory）, Each thread block contains shared memory （Shared Memory）, Can be shared by all threads in the thread block , Its life cycle is consistent with the thread block . Besides , All threads have access to global memory （Global Memory）. You can also access some read-only memory blocks ： Constant memory （Constant Memory） And texture memory （Texture Memory）. Memory structure involves program optimization , They are not discussed in depth here .

CUDA Memory model

There's also an important point , You need to GPU Have a basic understanding of the hardware implementation of . It says kernel Thread organization hierarchy , So one kernel It actually starts a lot of threads , These threads are logically parallel , But in the physical layer, it is not necessarily . This is actually the same as CPU Multithreading has similarities , Multithreading without multi-core support , It is also impossible to achieve parallelism in the physical layer . But fortunately, GPU There is a lot of CUDA The core , make the best of CUDA The core can give full play to GPU The ability of parallel computing .GPU One of the core components of hardware is SM, I've said that before ,SM The English name is Streaming Multiprocessor, Stream multiprocessor .SM Its core components include CUDA The core , Shared memory , Register, etc ,SM Hundreds of threads can be executed concurrently , Concurrency depends on SM Number of resources owned . When one kernel When executed , its gird Thread blocks in are allocated to SM On , A thread block can only be in one SM Was dispatched to .SM In general, multiple thread blocks can be scheduled , It depends SM Own ability . So it's possible that one kernel Each thread block of is allocated multiple SM, therefore grid It's just logic , and SM It's the physical layer of execution .SM It's using SIMT (Single-Instruction, Multiple-Thread, Single instruction multithreading ) framework , The basic execution unit is the thread bundle （warps), The thread bundle contains 32 Threads , These threads execute the same instructions at the same time , But each thread contains its own instruction address counter and register status , It also has its own independent execution path . So although the threads in the thread bundle execute from the same program address at the same time , But it may have different behaviors , For example, when you encounter a branch structure , Some threads may enter this branch , But others may not execute , They can only wait , because GPU Specifies that all threads in the thread bundle execute the same instruction in the same cycle , Thread bundle differentiation can lead to performance degradation . When the thread block is divided into a SM Upper time , It will be further divided into multiple thread bundles , Because that's what it is SM The basic execution unit of , But one SM The number of concurrent thread bundles is limited . This is because of resource constraints ,SM To allocate shared memory for each thread block , Separate registers should also be allocated to the threads in each thread bundle . therefore SM The configuration of will affect the number of thread blocks and thread bundles it supports . All in all , That is, the grid and thread blocks are only logical partitions , One kernel In fact, all threads in the physical layer are not necessarily concurrent at the same time . therefore kernel Of grid and block Different configurations of , Performance will vary , This should be paid special attention to . also , because SM The basic execution unit of contains 32 Thread bundle of threads , therefore block The size is usually set to 32 Multiple .

CUDA The logic layer and physical layer of programming

It's going on CUDA Before programming , You can check your own GPU Hardware configuration , In this way, we can have a specific target , You can get... Through the following procedure GPU Configuration properties of ：

  int dev = 0;
    cudaDeviceProp devProp;
    CHECK(cudaGetDeviceProperties(&devProp, dev));
    std::cout << " Use GPU device " << dev << ": " << devProp.name << std::endl;
    std::cout << "SM The number of ：" << devProp.multiProcessorCount << std::endl;
    std::cout << " Shared memory size per thread block ：" << devProp.sharedMemPerBlock / 1024.0 << " KB" << std::endl;
    std::cout << " Maximum number of threads per thread block ：" << devProp.maxThreadsPerBlock << std::endl;
    std::cout << " Every EM Is the maximum number of threads ：" << devProp.maxThreadsPerMultiProcessor << std::endl;
    std::cout << " Every SM The maximum number of thread bundles ：" << devProp.maxThreadsPerMultiProcessor / 32 << std::endl;

    //  Output is as follows 
     Use GPU device 0: GeForce GT 730
    SM The number of ：2
     Shared memory size per thread block ：48 KB
     Maximum number of threads per thread block ：1024
     Every EM Is the maximum number of threads ：2048
     Every EM The maximum number of thread bundles ：64

ok ,GT 730 The graphics card is really a little scum , Only 2 individual SM, Sobbing ......

Vector addition example

got it CUDA Programming based , Let's have a simple practical battle , utilize CUDA Programming the addition of two vectors , Before implementation , Just a quick introduction CUDA Memory management in programming API. The first is device Memory allocated on the cudaMalloc function ：

cudaError_t cudaMalloc(void** devPtr, size_t size);

The sum of this function C In language malloc similar , But in device Apply for a certain byte size of video memory , among devPtr Is a pointer to the allocated memory . At the same time, free up the allocated memory usage cudaFree function , This sum C In language free Function corresponds to . Another important function is responsible for host and device Data communication between cudaMemcpy function ：

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind)

among src Point to the data source , and dst It's the target area ,count Is the number of bytes copied , among kind Controls the direction of replication ：cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost And cudaMemcpyDeviceToDevice, Such as cudaMemcpyHostToDevice take host Copy data to device On .

Now let's implement an example of vector addition , here grid and block All designed as 1-dim, First define kernel as follows ：

//  Two vector addition kernel,grid and block All are one-dimensional 
__global__ void add(float* x, float * y, float* z, int n)
{
    //  Get global index 
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    //  step 
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
    {
        z[i] = x[i] + y[i];
    }
}

among stride As a whole grid Number of threads for , Sometimes a vector has a lot of elements , At this point, multiple elements can be implemented in each thread （ The total number of elements / Total threads ） Addition of , It is equivalent to using multiple grid To deal with it , This is a kind of grid-stride loop The way , However, in the following example, a thread only processes one element , therefore kernel The inner loop is not executed . Let's implement vector addition ：

int main()
{
    int N = 1 << 20;
    int nBytes = N * sizeof(float);
    //  apply host Memory 
    float *x, *y, *z;
    x = (float*)malloc(nBytes);
    y = (float*)malloc(nBytes);
    z = (float*)malloc(nBytes);

    //  Initialization data 
    for (int i = 0; i < N; ++i)
    {
        x[i] = 10.0;
        y[i] = 20.0;
    }

    //  apply device Memory 
    float *d_x, *d_y, *d_z;
    cudaMalloc((void**)&d_x, nBytes);
    cudaMalloc((void**)&d_y, nBytes);
    cudaMalloc((void**)&d_z, nBytes);

    //  take host Copy the data to device
    cudaMemcpy((void*)d_x, (void*)x, nBytes, cudaMemcpyHostToDevice);
    cudaMemcpy((void*)d_y, (void*)y, nBytes, cudaMemcpyHostToDevice);
    //  Definition kernel Implementation configuration of 
    dim3 blockSize(256);
    dim3 gridSize((N + blockSize.x - 1) / blockSize.x);
    //  perform kernel
    add << < gridSize, blockSize >> >(d_x, d_y, d_z, N);

    //  take device Copy the results to host
    cudaMemcpy((void*)z, (void*)d_z, nBytes, cudaMemcpyDeviceToHost);

    //  Check the execution results 
    float maxError = 0.0;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(z[i] - 30.0));
    std::cout << " maximum error : " << maxError << std::endl;

    //  Release device Memory 
    cudaFree(d_x);
    cudaFree(d_y);
    cudaFree(d_z);
    //  Release host Memory 
    free(x);
    free(y);
    free(z);

    return 0;
}

Here our vector size is 1<<20, and block The size is 256, that grid Size is 4096,kernel The thread hierarchy of is shown in the figure below ：

kernel Thread hierarchy . source ：https://devblogs.nvidia.com/even-easier-introduction-cuda/

Use nvprof Tools can analyze kernel Operation of the , The results are shown below , You can see kernel The function takes about 1.5ms.

nvprof cuda9.exe
==7244== NVPROF is profiling process 7244, command: cuda9.exe
 maximum error : 4.31602e+008
==7244== Profiling application: cuda9.exe
==7244== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   67.57%  3.2256ms         2  1.6128ms  1.6017ms  1.6239ms  [CUDA memcpy HtoD]
                   32.43%  1.5478ms         1  1.5478ms  1.5478ms  1.5478ms  add(float*, float*, float*, int)

You adjust block Size , Compare... Under different configurations kernel Operation of the , What I'm testing here is when block by 128 when ,kernel Time consuming 1.6ms, and block by 512 when kernel Time consuming 1.7ms, When block by 64 when ,kernel Time consuming 2.3ms. It doesn't seem to be block The bigger the better , Instead, choose .

In the above implementation , We need to be alone in host and device Memory allocation on the , And copy the data , It's easy to make mistakes . Fortunately CUDA 6.0 Introduce unified memory （Unified Memory） To avoid this trouble , To put it simply, unified memory uses a managed memory to jointly manage host and device Memory in , And automatically in host and device Data transmission in .CUDA Use in cudaMallocManaged Function to allocate managed memory ：

cudaError_t cudaMallocManaged(void **devPtr, size_t size, unsigned int flag=0);

Use unified memory , The above procedure can be simplified as follows ：

int main()
{
    int N = 1 << 20;
    int nBytes = N * sizeof(float);

    //  Request managed memory 
    float *x, *y, *z;
    cudaMallocManaged((void**)&x, nBytes);
    cudaMallocManaged((void**)&y, nBytes);
    cudaMallocManaged((void**)&z, nBytes);

    //  Initialization data 
    for (int i = 0; i < N; ++i)
    {
        x[i] = 10.0;
        y[i] = 20.0;
    }

    //  Definition kernel Implementation configuration of 
    dim3 blockSize(256);
    dim3 gridSize((N + blockSize.x - 1) / blockSize.x);
    //  perform kernel
    add << < gridSize, blockSize >> >(x, y, z, N);

    //  Sync device  Ensure that the results can be accessed correctly 
    cudaDeviceSynchronize();
    //  Check the execution results 
    float maxError = 0.0;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(z[i] - 30.0));
    std::cout << " maximum error : " << maxError << std::endl;

    //  Free memory 
    cudaFree(x);
    cudaFree(y);
    cudaFree(z);

    return 0;
}

Compared to the previous code , Using unified memory is simpler , It is worth noting that kernel Execution is related to host Asynchronous , Because managed memory automatically transfers data , Here want to use cudaDeviceSynchronize() Function guarantees device and host Sync , In this way, you can correctly access kernel The result of the calculation .

Matrix multiplication example

Finally, let's implement a slightly more complex example , Is the multiplication of two matrices , Set the input matrix as

Failed to re upload and cancel the transfer and , In order to get . The implementation idea is that each thread calculates An element value of , For matrix operation , It should be selected grid and block by 2-D Of . First define the structure of the matrix ：

//  Matrix type , Line first ,M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
{
    int width;
    int height;
    float *elements;
};

Matrix multiplication implementation mode

Then realize the kernel function of matrix multiplication , Here we define two auxiliary __device__ The function is used to obtain the element value of the matrix and assign a value to the matrix element , The specific code is as follows ：

//  Get matrix A Of (row, col) Elements 
__device__ float getElement(Matrix *A, int row, int col)
{
	return A->elements[row * A->width + col];
}

//  For matrix A Of (row, col) Element assignment 
__device__ void setElement(Matrix *A, int row, int col, float value)
{
	A->elements[row * A->width + col] = value;
}

//  matrix multiplication kernel,2-D, Each thread calculates an element 
__global__ void matMulKernel(Matrix *A, Matrix *B, Matrix *C)
{
	float Cvalue = 0.0;
	int row = threadIdx.y + blockIdx.y * blockDim.y;
	int col = threadIdx.x + blockIdx.x * blockDim.x;
	for (int i = 0; i < A->width; ++i)
	{
		Cvalue += getElement(A, row, i) * getElement(B, i, col);
	}
	setElement(C, row, col, Cvalue);
}

Finally, we use unified memory to write a test example of matrix multiplication ：

int main()
{
    int width = 1 << 10;
    int height = 1 << 10;
    Matrix *A, *B, *C;
    //  Request managed memory 
    cudaMallocManaged((void**)&A, sizeof(Matrix));
    cudaMallocManaged((void**)&B, sizeof(Matrix));
    cudaMallocManaged((void**)&C, sizeof(Matrix));
    int nBytes = width * height * sizeof(float);
    cudaMallocManaged((void**)&A->elements, nBytes);
    cudaMallocManaged((void**)&B->elements, nBytes);
    cudaMallocManaged((void**)&C->elements, nBytes);

    //  Initialization data 
    A->height = height;
    A->width = width;
    B->height = height;
    B->width = width;
    C->height = height;
    C->width = width;
    for (int i = 0; i < width * height; ++i)
    {
        A->elements[i] = 1.0;
        B->elements[i] = 2.0;
    }

    //  Definition kernel Implementation configuration of 
    dim3 blockSize(32, 32);
    dim3 gridSize((width + blockSize.x - 1) / blockSize.x, 
        (height + blockSize.y - 1) / blockSize.y);
    //  perform kernel
    matMulKernel << < gridSize, blockSize >> >(A, B, C);


    //  Sync device  Ensure that the results can be accessed correctly 
    cudaDeviceSynchronize();
    //  Check the execution results 
    float maxError = 0.0;
    for (int i = 0; i < width * height; ++i)
        maxError = fmax(maxError, fabs(C->elements[i] - 2 * width));
    std::cout << " maximum error : " << maxError << std::endl;

    return 0;
}

The size of the matrix here is , Designed thread block The size is (32, 32), that grid The size is (32, 32), The final test results are as follows ：

nvprof cuda9.exe
==16304== NVPROF is profiling process 16304, command: cuda9.exe
 maximum error : 0
==16304== Profiling application: cuda9.exe
==16304== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.32752s         1  1.32752s  1.32752s  1.32752s  matMulKernel(Matrix*, Matrix*, Matrix*)
      API calls:   83.11%  1.32762s         1  1.32762s  1.32762s  1.32762s  cudaDeviceSynchronize
                   13.99%  223.40ms         6  37.233ms  37.341us  217.66ms  cudaMallocManaged
                    2.81%  44.810ms         1  44.810ms  44.810ms  44.810ms  cudaLaunch
                    0.08%  1.3300ms        94  14.149us       0ns  884.64us  cuDeviceGetAttribute
                    0.01%  199.03us         1  199.03us  199.03us  199.03us  cuDeviceGetName
                    0.00%  10.009us         1  10.009us  10.009us  10.009us  cuDeviceTotalMem
                    0.00%  6.5440us         1  6.5440us  6.5440us  6.5440us  cudaConfigureCall
                    0.00%  3.0800us         3  1.0260us     385ns  1.5400us  cudaSetupArgument
                    0.00%  2.6940us         3     898ns     385ns  1.5390us  cuDeviceGetCount
                    0.00%  1.9250us         2     962ns     385ns  1.5400us  cuDeviceGet

==16304== Unified Memory profiling result:
Device "GeForce GT 730 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    2051  4.0000KB  4.0000KB  4.0000KB  8.011719MB  21.20721ms  Host To Device
     270  45.570KB  4.0000KB  1.0000MB  12.01563MB  7.032508ms  Device To Host

Of course , This is not the most efficient implementation , You can continue to optimize later ...