当前位置:网站首页>CUDA programming-05: flows and events
CUDA programming-05: flows and events
2022-07-27 08:52:00 【DeepDriving】
This article was first published on WeChat public 【DeepDriving】, Welcome to your attention .
CUDA flow
stay CUDA There are two levels of concurrency in : Kernel level concurrency and grid level concurrency . The previous article introduced kernel level concurrency , This concurrency is achieved by using multiple GPU Threads to complete a kernel task concurrently , Grid level concurrency is to decompose a task into multiple kernel tasks , The concurrent execution of tasks is achieved by running multiple kernel tasks concurrently on a device , This way makes the utilization rate of equipment higher .CUDA A stream is a collection of asynchronous operations , The same CUDA The operations in the stream are in strict order GPU Up operation , Using multiple streams to start multiple kernel tasks at the same time can achieve grid level concurrency .
First, let's review a typical CUDA Procedure execution flow :
- Take data from
hostcopy todeviceOn ; - stay
deviceExecute kernel tasks on ; - Take data from
deviceCopy tohostOn .
These operations will be in one CUDA Run in stream , If you explicitly create a flow, then this flow is an explicit flow ( Non empty flow ) Otherwise, it is implicit flow ( Air flow ), The previous article introduced CUDA Routines are run in implicit streams . If you explicitly create multiple streams to perform the above 3 There are two steps , So different CUDA Operations can be overlapped , Refer to the below :

You can see , Using multiple streams can improve the whole CUDA The efficiency of the program . You can declare and create an explicit flow using the following method :
cudaStream_t stream;
cudaStreamCreate(&stream);
To destroy a stream, you can use the following function
cudaError_t cudaStreamDestroy(cudaStream_t stream);
Because operations in an explicit flow must be asynchronous , While using cudaMemcpy Function to copy data is a synchronous operation , So you must use its asynchronous version to copy data in an explicit stream
cudaError_t cudaMemcpyAsync(void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0);
The last parameter of this function is used to specify a stream identifier , Empty streams are used by default . To perform asynchronous data transmission , Then it must be in host Use fixed memory on , Because this can ensure that it is CPU The physical address in memory will not be changed throughout the life cycle of the application . You can use the following two functions in host Allocate fixed memory on :
cudaError_t cudaMallocHost(void **ptr, size_t size);
cudaError_t cudaHostAlloc(void **pHost, size_t size, unsigned int flags);
When starting the kernel in a non empty stream , A stream identifier must be provided in the kernel execution configuration as the 4 Parameters ( The first 3 Parameters are the size of shared memory , If there is no allocation, it can be set to 0):
kernel_name<<<grid, block, sharedMemSize, stream>>>(...);
All operations of an explicit flow are asynchronous , Can be in host The following two functions are called in the code to check whether all operations in the flow are completed :
cudaError_t cudaStreamSynchronize(cudaStream_t stream);
cudaError_t cudaStreamQuery(cudaStream_t stream);
cudaStreamSynchronize Function will force blocking host Until all operations in the specified flow have been executed ;cudaStreamQuery Functions do not block host, If all operations in the specified stream have been completed , It will be returned cudaSuccess, Otherwise return to cudaErrorNotReady.
CUDA event
One CUDA Event is CUDA A marker point in the stream , It can be used to check whether the running stream operation has reached this point . Using events can be used to perform the following two basic tasks :
- Execution of synchronization flow
- monitor
deviceThe progress of the
CUDA It provides functions to insert and query the completion of events at any point in the stream , Only when all previous operations in the flow are completed , The events recorded in the flow will work .
The way to declare and create an event is as follows :
cudaEvent_t event;
cudaError_t cudaEventCreate(cudaEvent_t* event);
Call the following function to destroy an event
cudaError_t cudaEventDestroy(cudaEvent_t event);
An event can be entered using the following function CUDA In the operation queue of the stream
cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0);
The following function will be in host Wait for an event to complete in a blocking way
cudaError_t cudaEventSynchronize(cudaEvent_t event);
Similar to flow , You can also query the completion of the event in a non blocking way
cudaError_t cudaEventQuery(cudaEvent_t event);
If you want to know the time spent in the operation between two events , You can call
cudaError_t cudaEventElapsedTime(float* ms, cudaEvent_t start, cudaEvent_t stop);
This function returns the elapsed time between the start and stop events in milliseconds , Start and stop events do not have to be in the same CUDA Streaming .
You can refer to the following code :
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
VectorAddGPU<<<block_per_grid, thread_per_block>>>(da, db, dc, size);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
std::cout << "Elapsed time: " << elapsed_time << " ms." << std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
Stream synchronization
CUDA There are two types of host-device Sync : Show synchronization and implicit synchronization .
Many functions introduced in the previous article are implicitly synchronized , such as cudaMemcpy function , It will make host Applications will be blocked before data transmission is completed . Many memory related operations have implicit synchronization behavior , such as :
hostFixed memory allocation on , such ascudaMallocHostdeviceMemory allocation on , such ascudaMallocdeviceMemory initialization on- same
deviceMemory copy between the last two addresses - First level cache / Modification of shared memory configuration
CUDA Several methods of display synchronization are provided :
- Use
cudaDeviceSynchronizeFunction synchronizationdevice - Use
cudaStreamSynchronizeFunction synchronization flow - Use
cudaEventSynchronizeFunctions synchronize events in the stream
besides ,CUDA The following functions are also provided for cross stream synchronization using events :
cudaError_t cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event);
This function can make the specified flow wait for the specified event , This event may be related to the same flow , It may also be related to different streams , If there are different streams, this function is to perform cross stream synchronization .
Reference material
- 《
CUDA C Programming authority Guide》 - 《
Professional CUDA C Programming》 - 《
CUDA C Programming Guide》 - 《
CUDA Programming:A Developer's Guide to Parallel Computing with GPUs》
Welcome to my official account. 【DeepDriving】, I will share computer vision from time to time 、 machine learning 、 Deep learning 、 Driverless and other fields .

边栏推荐
- [interprocess communication IPC] - semaphore learning
- ROS2安装时出现Connection failed [IP: 91.189.91.39 80]
- Query and association of flask to database
- 说透缓存一致性与内存屏障
- Aruba学习笔记10-安全认证-Portal认证(web页面配置)
- Horse walking oblique sun (backtracking method)
- 4274. 后缀表达式
- Unity3d 2021 software installation package download and installation tutorial
- Minio installation and use
- User management - restrictions
猜你喜欢

无法获取下列许可SOLIDWORKS Standard,无法找到使用许可文件。(-1,359,2)。

The wechat installation package has soared from 0.5m to 260m. Why are our programs getting bigger and bigger?

Include error in vs Code (new header file)

String type and bitmap of redis

4276. 擅长C

Aruba learning notes 10 security authentication portal authentication (web page configuration)

08_ Service fusing hystrix

PyTorch自定义CUDA算子教程与运行时间分析

NiO Summary - read and understand the whole NiO process

Can "Gulangyu yuancosmos" become an "upgraded sample" of China's cultural tourism industry
随机推荐
4277. Block reversal
3311. Longest arithmetic
“寻源到结算“与“采购到付款“两者有什么不同或相似之处?
P7 Day1 get to know the flask framework
691. Cube IV
Hangzhou E-Commerce Research Institute released an explanation of the new term "digital existence"
Initial summary of flask framework creation project
4278. 峰会
Redis network IO
Day6 --- Sqlalchemy advanced
3428. 放苹果
Kibana uses JSON document data
Interface test tool - JMeter pressure test use
4276. Good at C
NiO example
New year's goals! The code is more standardized!
NiO Summary - read and understand the whole NiO process
Can "Gulangyu yuancosmos" become an "upgraded sample" of China's cultural tourism industry
Forced login, seven cattle cloud upload pictures
4278. Summit