Thread index calculation
You only need to know the initial index of the parallel thread , And how to determine the incremental value , We want each parallel thread to start with a different index , Therefore, we need to linearize the thread index and thread block index , The actual index of each thread is calculated according to the following formula :
int tid = threadIdx.x + blockIdx.x * blockDim.x;
Limit the number of thread blocks :65536
Limit the number of threads in a thread block :512
Shared memory and synchronization
Shared memory
__share__ Add to variable declaration , Make declared variables reside in shared memory .cuda c The compiler adopts different processing strategies for variables in shared memory and ordinary variables , about GPU Each thread block started ,cuda c Will create a copy of the variable . All threads in the thread block will share this memory . But the thread cannot see or modify the variable copies of other thread blocks . This is the basis of communication and cooperation between different threads in the same thread block .
Shared memory buffers reside in physical gpu On , Access latency is extremely low .
Sync
__syncthreads(); Synchronize the threads in the thread block . Thread divergence is easy to appear , It is necessary to synchronize threads in some scenarios .
Thread divergence : When certain threads need to execute some instructions , When other threads don't need to be executed , This is called thread divergence . In a normal environment , Divergent branches can leave some threads idle , Other threads will execute the code in the thread . stay __syncthreads() In the case , The result of thread divergence is a little bad ,cuda The architecture will ensure that , Unless all the threads in the thread block are synchronized , Otherwise, no thread can execute the instruction after the synchronization operation .
Constant memory and events
Constant memory :NVIDIA Provide 64k Constant memory of , Effectively reduce memory bandwidth .__constant__ Restrict access to variables to read-only .
Copy from host memory to GPU Constant memory on , Usage method cudaMemcpyToSymbol() replicate .
Performance improvement reasons
(1) A single operation on constant memory can be broadcast to other adjacent threads , save 15 Read and write operations ;
When dealing with constant memory ,NVIDIA The hardware broadcasts a single memory read to each half thread bundle .
(2) Constant memory data will be cached , Therefore, continuous access to the same address does not generate additional memory traffic .
Thread bundles :warp
stay cuda Architecture , Thread bundle refers to a containing 32 A collection of threads , These threads are programmed together , And in step (LockStep) The form of execution , In every line of the program , Each thread in the thread bundle will execute the same command on different data .
event API
cuda Is essentially a timestamp , This timestamp is recorded at the time specified by the user . There are only two steps to get a timestamp : Create an event , Record an event .
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// gpu Perform the operation
...
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
Liberal arts memory (Texture Memory)
brief introduction :
Similar to constant memory , read-only , Cache on the chip , Reduce memory requests , And provide more efficient memory bandwidth . Specifically for a large amount of spatial locality in memory access mode (special locality) For graphic applications . In a computing application , This means that the reading position of one thread may be very close to that of adjacent threads . Textured memory is designed to speed up this memory access pattern .
The data in texture memory is in one dimension 、 Two or three-dimensional arrays are stored in video memory , Access can be accelerated through caching , And you can declare that the size is much larger than constant memory . stay kernel The operation of accessing texture memory in is called texture picking . The operation of associating data in video memory with a texture reference frame , It's called binding data to texture . There are two kinds of data that can be bound to texture in video memory , They are ordinary linear memory and cuda Array .
Use steps :
(1) You need to declare the input data as texture Type references ; Declare variables in gpu On ;
(2) gpu Allocate memory in , adopt cudaBindTexture() Bind variables to memory buffers . tell cuda Two things at runtime :
- We want to use the specified buffer as a texture ;
- We want to use texture references as texture's “ name ”.
(3) Start the kernel function , When reading texture in kernel function , It needs to be told by special functions GPU Forward read requests to texture memory instead of standard global memory , Use compiler built-in functions :tex1Dfetch();
(4) Release buffer , Clear the binding to the texture ,cudaUnbindTexture();
flow
Page locked memory
Page lock host memory , Fixed memory , Non pageable memory ,OS This memory will not be paged and swapped to disk . This ensures that the memory always resides in the physical memory .OS The physical address where an application can safely access the memory , This memory will not be destroyed or relocated .
- malloc The distribution is standard 、 Pageable host memory ;
- cudaHostAlloc Will allocate page locked host memory .
Suggest : Only on cudaMemcpy() The source or destination memory of the call , To use page locked memory , And when you don't need to use them , Immediately release .
flow
Devices that support device overlap , Support device overlap function GPU Be able to execute a CUDA C At the same time as the kernel function , It can also replicate between the device and the host .
Some new GPU The device supports both kernel functions and two copy operations , Once from the host to the device , Once from the device to the host . On any device that supports duplication of memory and execution of kernel functions , When using multiple streams , The overall performance of the application can be improved .
Determine whether the device supports the overlap of computation and memory copy operations :
int main( void ) {
cudaDeviceProp prop;
int whichDevice;
HANDLE_ERROR( cudaGetDevice(&whichDevice) );
HANDLE_ERROR( cudaGetDeviceProperties(&prop, whichDevice) );
if(!prop.deviceOverlap) {
printf("Device will not handle overlaps");
return 0;
}
}
many GPU On the system CUDA C
Zero copy memory : Can be in cuda C In kernel function , Direct access to this type of host memory , Because this memory does not need to be copied to GPU, So it's called zero copy memory . adopt cudaHostAlloc Distribute , The last parameter is :cudaHostAllocMapped.
Determine whether the device supports mapping host memory :
int main( void ) {
cudaDeviceProp prop;
int whichDevice;
HANDLE_ERROR( cudaGetDevice(&whichDevice) );
HANDLE_ERROR( cudaGetDeviceProperties(&prop, whichDevice) );
if(prop.canMapHostMemory != 1) {
printf("Device can not map memory");
return 0;
}
}
When both input and output memory are used only once , So in independence GPU Using zero copy memory on will bring performance improvement .
Judge a certain GPU Is it integrated or independent :
cudaGetDeviceProperties() Get attribute struct , The domain in the structure :integrated, If the device is integrated GPU, The value is true, Otherwise false.
Be careful : many GPU scenario , Every gpu If you want to run gpu Procedure words , All need a host cpu Start a separate thread for resource control , Each has its own thread .
《Programming Massively Parallel Processors: a Hands-On Approach》