当前位置:网站首页>Common concepts and points for attention of CUDA

Common concepts and points for attention of CUDA

2020-11-09 22:37:00 Li Baqian

Thread index calculation

You only need to know the initial index of the parallel thread , And how to determine the incremental value , We want each parallel thread to start with a different index , Therefore, we need to linearize the thread index and thread block index , The actual index of each thread is calculated according to the following formula :

int tid = threadIdx.x + blockIdx.x * blockDim.x;

Limit the number of thread blocks :65536

Limit the number of threads in a thread block :512

Shared memory and synchronization

Shared memory

__share__ Add to variable declaration , Make declared variables reside in shared memory .cuda c The compiler adopts different processing strategies for variables in shared memory and ordinary variables , about GPU Each thread block started ,cuda c Will create a copy of the variable . All threads in the thread block will share this memory . But the thread cannot see or modify the variable copies of other thread blocks . This is the basis of communication and cooperation between different threads in the same thread block .

Shared memory buffers reside in physical gpu On , Access latency is extremely low .

Sync

__syncthreads(); Synchronize the threads in the thread block . Thread divergence is easy to appear , It is necessary to synchronize threads in some scenarios .

Thread divergence : When certain threads need to execute some instructions , When other threads don't need to be executed , This is called thread divergence . In a normal environment , Divergent branches can leave some threads idle , Other threads will execute the code in the thread . stay __syncthreads() In the case , The result of thread divergence is a little bad ,cuda The architecture will ensure that , Unless all the threads in the thread block are synchronized , Otherwise, no thread can execute the instruction after the synchronization operation .

Constant memory and events

Constant memory :NVIDIA Provide 64k Constant memory of , Effectively reduce memory bandwidth .__constant__ Restrict access to variables to read-only .

Copy from host memory to GPU Constant memory on , Usage method cudaMemcpyToSymbol() replicate .

Performance improvement reasons

(1) A single operation on constant memory can be broadcast to other adjacent threads , save 15 Read and write operations ;

When dealing with constant memory ,NVIDIA The hardware broadcasts a single memory read to each half thread bundle .

(2) Constant memory data will be cached , Therefore, continuous access to the same address does not generate additional memory traffic .

Thread bundles :warp

stay cuda Architecture , Thread bundle refers to a containing 32 A collection of threads , These threads are programmed together , And in step (LockStep) The form of execution , In every line of the program , Each thread in the thread bundle will execute the same command on different data .

event API

cuda Is essentially a timestamp , This timestamp is recorded at the time specified by the user . There are only two steps to get a timestamp : Create an event , Record an event .

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// gpu Perform the operation 
...
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);

cudaEventDestroy(start);
cudaEventDestroy(stop);

Liberal arts memory (Texture Memory)

brief introduction :

​ Similar to constant memory , read-only , Cache on the chip , Reduce memory requests , And provide more efficient memory bandwidth . Specifically for a large amount of spatial locality in memory access mode (special locality) For graphic applications . In a computing application , This means that the reading position of one thread may be very close to that of adjacent threads . Textured memory is designed to speed up this memory access pattern .

The data in texture memory is in one dimension 、 Two or three-dimensional arrays are stored in video memory , Access can be accelerated through caching , And you can declare that the size is much larger than constant memory . stay kernel The operation of accessing texture memory in is called texture picking . The operation of associating data in video memory with a texture reference frame , It's called binding data to texture . There are two kinds of data that can be bound to texture in video memory , They are ordinary linear memory and cuda Array .

Use steps :

(1) You need to declare the input data as texture Type references ; Declare variables in gpu On ;

(2) gpu Allocate memory in , adopt cudaBindTexture() Bind variables to memory buffers . tell cuda Two things at runtime :

  • We want to use the specified buffer as a texture ;
  • We want to use texture references as texture's “ name ”.

(3) Start the kernel function , When reading texture in kernel function , It needs to be told by special functions GPU Forward read requests to texture memory instead of standard global memory , Use compiler built-in functions :tex1Dfetch();

(4) Release buffer , Clear the binding to the texture ,cudaUnbindTexture();

flow

Page locked memory

Page lock host memory , Fixed memory , Non pageable memory ,OS This memory will not be paged and swapped to disk . This ensures that the memory always resides in the physical memory .OS The physical address where an application can safely access the memory , This memory will not be destroyed or relocated .

  • malloc The distribution is standard 、 Pageable host memory ;
  • cudaHostAlloc Will allocate page locked host memory .

Suggest : Only on cudaMemcpy() The source or destination memory of the call , To use page locked memory , And when you don't need to use them , Immediately release .

flow

Devices that support device overlap , Support device overlap function GPU Be able to execute a CUDA C At the same time as the kernel function , It can also replicate between the device and the host .

Some new GPU The device supports both kernel functions and two copy operations , Once from the host to the device , Once from the device to the host . On any device that supports duplication of memory and execution of kernel functions , When using multiple streams , The overall performance of the application can be improved .

Determine whether the device supports the overlap of computation and memory copy operations :

int main( void ) {
	cudaDeviceProp prop;
	int whichDevice;
	HANDLE_ERROR( cudaGetDevice(&whichDevice) );
	HANDLE_ERROR( cudaGetDeviceProperties(&prop, whichDevice) );
	if(!prop.deviceOverlap) {
      	printf("Device will not handle overlaps");
      	return 0;
	}
}

many GPU On the system CUDA C

Zero copy memory : Can be in cuda C In kernel function , Direct access to this type of host memory , Because this memory does not need to be copied to GPU, So it's called zero copy memory . adopt cudaHostAlloc Distribute , The last parameter is :cudaHostAllocMapped.

Determine whether the device supports mapping host memory :

int main( void ) {
	cudaDeviceProp prop;
	int whichDevice;
	HANDLE_ERROR( cudaGetDevice(&whichDevice) );
	HANDLE_ERROR( cudaGetDeviceProperties(&prop, whichDevice) );
	if(prop.canMapHostMemory != 1) {
      	printf("Device can not map memory");
      	return 0;
	}
}

When both input and output memory are used only once , So in independence GPU Using zero copy memory on will bring performance improvement .

Judge a certain GPU Is it integrated or independent :

cudaGetDeviceProperties() Get attribute struct , The domain in the structure :integrated, If the device is integrated GPU, The value is true, Otherwise false.

Be careful : many GPU scenario , Every gpu If you want to run gpu Procedure words , All need a host cpu Start a separate thread for resource control , Each has its own thread .

《Programming Massively Parallel Processors: a Hands-On Approach》

版权声明
本文为[Li Baqian]所创,转载请带上原文链接,感谢