当前位置：网站首页>CUDA programming-05: flows and events

CUDA programming-05: flows and events

2022-07-27 08:52:00 【DeepDriving】

This article was first published on WeChat public 【DeepDriving】, Welcome to your attention .

CUDA flow

stay CUDA There are two levels of concurrency in ： Kernel level concurrency and grid level concurrency . The previous article introduced kernel level concurrency , This concurrency is achieved by using multiple GPU Threads to complete a kernel task concurrently , Grid level concurrency is to decompose a task into multiple kernel tasks , The concurrent execution of tasks is achieved by running multiple kernel tasks concurrently on a device , This way makes the utilization rate of equipment higher .CUDA A stream is a collection of asynchronous operations , The same CUDA The operations in the stream are in strict order GPU Up operation , Using multiple streams to start multiple kernel tasks at the same time can achieve grid level concurrency .

First, let's review a typical CUDA Procedure execution flow ：

Take data from host copy to device On ;
stay device Execute kernel tasks on ;
Take data from device Copy to host On .

These operations will be in one CUDA Run in stream , If you explicitly create a flow, then this flow is an explicit flow （ Non empty flow ） Otherwise, it is implicit flow （ Air flow ）, The previous article introduced CUDA Routines are run in implicit streams . If you explicitly create multiple streams to perform the above 3 There are two steps , So different CUDA Operations can be overlapped , Refer to the below ：

Insert picture description here

You can see , Using multiple streams can improve the whole CUDA The efficiency of the program . You can declare and create an explicit flow using the following method ：

cudaStream_t stream;
cudaStreamCreate(&stream);

To destroy a stream, you can use the following function

cudaError_t cudaStreamDestroy(cudaStream_t stream);

Because operations in an explicit flow must be asynchronous , While using cudaMemcpy Function to copy data is a synchronous operation , So you must use its asynchronous version to copy data in an explicit stream

cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);

The last parameter of this function is used to specify a stream identifier , Empty streams are used by default . To perform asynchronous data transmission , Then it must be in host Use fixed memory on , Because this can ensure that it is CPU The physical address in memory will not be changed throughout the life cycle of the application . You can use the following two functions in host Allocate fixed memory on ：

cudaError_t cudaMallocHost(void **ptr, size_t size);
cudaError_t cudaHostAlloc(void **pHost, size_t size, unsigned int flags);

When starting the kernel in a non empty stream , A stream identifier must be provided in the kernel execution configuration as the 4 Parameters （ The first 3 Parameters are the size of shared memory , If there is no allocation, it can be set to 0）：

kernel_name<<<grid, block, sharedMemSize, stream>>>(...);

All operations of an explicit flow are asynchronous , Can be in host The following two functions are called in the code to check whether all operations in the flow are completed ：

cudaError_t cudaStreamSynchronize(cudaStream_t stream);
cudaError_t cudaStreamQuery(cudaStream_t stream);

cudaStreamSynchronize Function will force blocking host Until all operations in the specified flow have been executed ;cudaStreamQuery Functions do not block host, If all operations in the specified stream have been completed , It will be returned cudaSuccess, Otherwise return to cudaErrorNotReady.

CUDA event

One CUDA Event is CUDA A marker point in the stream , It can be used to check whether the running stream operation has reached this point . Using events can be used to perform the following two basic tasks ：

Execution of synchronization flow
monitor device The progress of the

CUDA It provides functions to insert and query the completion of events at any point in the stream , Only when all previous operations in the flow are completed , The events recorded in the flow will work .

The way to declare and create an event is as follows ：

cudaEvent_t event;
cudaError_t cudaEventCreate(cudaEvent_t* event);

Call the following function to destroy an event

cudaError_t cudaEventDestroy(cudaEvent_t event);

An event can be entered using the following function CUDA In the operation queue of the stream

cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0);

The following function will be in host Wait for an event to complete in a blocking way

cudaError_t cudaEventSynchronize(cudaEvent_t event);

Similar to flow , You can also query the completion of the event in a non blocking way

cudaError_t cudaEventQuery(cudaEvent_t event);

If you want to know the time spent in the operation between two events , You can call

cudaError_t cudaEventElapsedTime(float* ms, cudaEvent_t start, cudaEvent_t stop);

This function returns the elapsed time between the start and stop events in milliseconds , Start and stop events do not have to be in the same CUDA Streaming .

You can refer to the following code ：

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
VectorAddGPU<<<block_per_grid, thread_per_block>>>(da, db, dc, size);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);

std::cout << "Elapsed time: " << elapsed_time << " ms." << std::endl;

cudaEventDestroy(start);
cudaEventDestroy(stop);

Stream synchronization

CUDA There are two types of host-device Sync ： Show synchronization and implicit synchronization .

Many functions introduced in the previous article are implicitly synchronized , such as cudaMemcpy function , It will make host Applications will be blocked before data transmission is completed . Many memory related operations have implicit synchronization behavior , such as ：

host Fixed memory allocation on , such as cudaMallocHost
device Memory allocation on , such as cudaMalloc
device Memory initialization on
same device Memory copy between the last two addresses
First level cache / Modification of shared memory configuration

CUDA Several methods of display synchronization are provided ：

Use cudaDeviceSynchronize Function synchronization device
Use cudaStreamSynchronize Function synchronization flow
Use cudaEventSynchronize Functions synchronize events in the stream

besides ,CUDA The following functions are also provided for cross stream synchronization using events ：

cudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);

This function can make the specified flow wait for the specified event , This event may be related to the same flow , It may also be related to different streams , If there are different streams, this function is to perform cross stream synchronization .

Reference material

《CUDA C Programming authority Guide 》
《Professional CUDA C Programming》
《CUDA C Programming Guide》
《CUDA Programming:A Developer's Guide to Parallel Computing with GPUs》

Welcome to my official account. 【DeepDriving】, I will share computer vision from time to time 、 machine learning 、 Deep learning 、 Driverless and other fields .

Insert picture description here