当前位置：网站首页>CUDA_ Global memory and access optimization

CUDA_ Global memory and access optimization

2020-11-09 22:41:00 【Li Baqian】

Global memory

GPU Global memory ,CPU and GPU You can read and write . Any device can pass through PCI-E The bus accesses it ,GPU Between not through CPU, Take data directly from a piece of GPU The data on the card is transferred to another block GPU On .

The point-to-point nature is DUDA4.x SDK Introduction in . Support for specific platforms only （ Tesla hardware through TCC The driver model can support windows7 and windows Vista platform , about linux or windowsXP platform , Consumer machines GPU Both Ka and Tesla support ）.

CPU The host side processor can be configured in three ways GPU Access memory on the ：

Explicitly block transmission ;
Explicit nonblocking transmission ;
Implicitly using zero copy memory replication .

Once the data enters GPU, The main problem is how to do it in GPU For efficient access . By creating a schema that requires only one memory access every ten calculations , Memory latency can be significantly hidden , But the premise is that the access to global memory must be in the way of merging .

Whether the access to global memory satisfies the merge access condition is to CUDA One of the most obvious factors affecting program performance .

Merge access -- Global memory access optimization

All threads access contiguous aligned blocks of memory .

If we do one-to-one sequential alignment access to memory , Then the access addresses of each thread can be merged , Just store food once to solve the problem . Suppose we access a single precision or integer value , Each thread will access a 4 Byte memory block . Memory is merged based on thread bundles （ Old fashioned G80 Hardware uses half a thread bundle ）, That is to say, accessing memory once will get 32*4=128 Bytes of data .

Merge size support 32 byte 、64 byte 、128 byte , Decibels identify one byte per thread in the thread bundle 、16 Bit and 32 Bits read data in units , But the premise is that the access must be continuous , And take 32 The byte reference is set against it .

Will be standard cudaMalloc Replace with cudaMallocPitch, Can be allocated to aligned blocks of memory .

extern __host__ cudaError_t CUDARTAPI cudaMallocPitch(void **devPtr, size_t *pitch, size_t width, size_t height);

The first parameter of the method represents the pointer to the device memory pointer , The second parameter is a pointer to the true number of bytes per line after alignment , The third parameter is the width of the data to be opened , The unit is byte , The last parameter is the height of the array .

Merge access conditions require the same warp Or the same half-warp The thread in must access the aligned segment according to a certain word length .

Specific requirements for merging access in different devices ：

Ability to calculate 1.0、1.1 On the device , One half-warp No k Thread must access section in section k A word , also half-warp The address of the accessed segment must be aligned to the word length of each thread 16 times . Only support for word length 32bit、64bit、128bit Merge access to data of .
stay 1.2 And higher capacity equipment , Merger access requirements have been greatly relaxed , Support word length is 8bit（ Corresponding segment length 32Byte）、16bit（ Corresponding segment length 64Byte）、32bit/64bit/128bit（ Corresponding segment length 128Byte） To access the data of .

The following describes 1.2/1.3 One of the capabilities of hardware half-warp How to complete a merge visit .

First , Find the active thread with the lowest thread number （ front half-warp Thread in 0, Or after half-warp Thread in 16） The section where the address requested is located . about 8bit In terms of data , The section length is 32Byte, about 16bit In terms of data, the segment length is 64Byte, about 32、64、128bit In terms of data, the segment length is 128Byte.
then , Find the active thread whose address is also in this segment . If all threads access data in the first or second half of the segment , Then you can also reduce the data size of a transmission . for example , If the size of a segment is 128Byte, But only the top half or the bottom half is used , Then the actual size of the data transferred can be further reduced to 64Byte, Empathy , about 64Byte Combined transmission of segments of , If only the first half or the second half is used, it can be reduced to 32Byte.
transmitted , here , The thread executing the memory access instruction will be inactive , Execution resources are released for SM The others in the ready state warp Use .
Repeat the process , know half-warp All threads access ends .

It should be noted that , Through runtime API( Such as cudaMalloc()) Allocated memory , It has been guaranteed that its first address will at least press 256Byte Align . therefore , Choose the appropriate thread block size （ for example 16 Integer multiple ）, Can make half-warp Access requests are aligned by segment length . Use __align__(8) and __align__(16) Qualifier to define the structure , You can make access to the array of structs to align to segment .

If the access period is not aligned or interval access, the effective bandwidth will be greatly reduced . In the case of interval access to video memory , Can use shared memory To achieve .

Global memory allocation

When using CUDA Running time , Both the device pointer and the host pointer type are void*.

Dynamic memory allocation

majority CUDA The global memory in is allocated dynamically , Use cuda Runtime , The following functions are used to allocate and release global memory .

cudaError_t cudaMalloc(void **, size_t);
cudaError_t cudaFree(void);

The corresponding driver API Function is ：

CUresult CUDAAPI cuMemAlloc(CUdeviceptr *dptr, size_t bytesize);
CUresult CUDAAPI cuMemFree(CUdeviceptr dptr);

It costs a lot to allocate global memory ,CUDA The driver implements a CUDA Sub allocator for small memory requests （suballocator）, But if this suballocator You have to create a new block of memory , This calls for a very expensive kernel mode driver for the operating system . If this happens ,CUDA The driver must be associated with GPU Sync , This may interrupt CPU、GPU Concurrent , therefore , It's a good practice to avoid allocating or freeing global memory in code with high performance requirements .

Static memory allocation

By using __device__ Keyword tags can be marked in the memory declaration . This memory is made up of cuda The driver allocates when the module is loaded .

Runtime API：

cudaError_t cudaMemcpyToSymbol(
	char *symbol,
	const void *src,
	size_t count,
	size_t offset=0,
	enum cudaMemcpyKind kind=cudaMemcpyHostToDevice
);

cudaError_t cudaMemcpyFromSymbol(
	void *dst,
	char *symbol,
	size_t count,
	size_t offset,
	enum cudaMemcpyKind kind=cudaMemcpyDeviceToHost
);

cuda Runtime applications can call functions cudaGetSymbolAddress() Query pointers associated with statically allocated memory .

cudaError_t cudaGetSymbolAddress(void **devPtr, char *symbol);

The driver API：

CUresult CUDAAPI cuModuleGetGlobal(CUdeviceptr *dptr, size_t *bytes, CUmodule hmod, const char *name);

This function returns the base pointer and object size . If we don't need size , Can be in bytes Parameters of the incoming NULL.

Pointer query

cuda Track all memory allocations , And provide API Enable applications to query CUDA All the pointers in . Function libraries and plug-ins can use different processing strategies on the basis of .

struct cudaPointerAttributes{
  	enum cudaMemoryType memoryType;
  	int device;
  	void *devicePointer;
  	void *hostPointer;
}

版权声明
本文为[Li Baqian]所创，转载请带上原文链接，感谢