Global memory
GPU Global memory ,CPU and GPU You can read and write . Any device can pass through PCI-E The bus accesses it ,GPU Between not through CPU, Take data directly from a piece of GPU The data on the card is transferred to another block GPU On .
The point-to-point nature is DUDA4.x SDK Introduction in . Support for specific platforms only ( Tesla hardware through TCC The driver model can support windows7 and windows Vista platform , about linux or windowsXP platform , Consumer machines GPU Both Ka and Tesla support ).
CPU The host side processor can be configured in three ways GPU Access memory on the :
- Explicitly block transmission ;
- Explicit nonblocking transmission ;
- Implicitly using zero copy memory replication .
Once the data enters GPU, The main problem is how to do it in GPU For efficient access . By creating a schema that requires only one memory access every ten calculations , Memory latency can be significantly hidden , But the premise is that the access to global memory must be in the way of merging .
Whether the access to global memory satisfies the merge access condition is to CUDA One of the most obvious factors affecting program performance .
Merge access -- Global memory access optimization
All threads access contiguous aligned blocks of memory .
If we do one-to-one sequential alignment access to memory , Then the access addresses of each thread can be merged , Just store food once to solve the problem . Suppose we access a single precision or integer value , Each thread will access a 4 Byte memory block . Memory is merged based on thread bundles ( Old fashioned G80 Hardware uses half a thread bundle ), That is to say, accessing memory once will get 32*4=128 Bytes of data .
Merge size support 32 byte 、64 byte 、128 byte , Decibels identify one byte per thread in the thread bundle 、16 Bit and 32 Bits read data in units , But the premise is that the access must be continuous , And take 32 The byte reference is set against it .
Will be standard cudaMalloc
Replace with cudaMallocPitch
, Can be allocated to aligned blocks of memory .
extern __host__ cudaError_t CUDARTAPI cudaMallocPitch(void **devPtr, size_t *pitch, size_t width, size_t height);
The first parameter of the method represents the pointer to the device memory pointer , The second parameter is a pointer to the true number of bytes per line after alignment , The third parameter is the width of the data to be opened , The unit is byte , The last parameter is the height of the array .
Merge access conditions require the same warp
Or the same half-warp
The thread in must access the aligned segment according to a certain word length .
Specific requirements for merging access in different devices :
- Ability to calculate 1.0、1.1 On the device , One half-warp No k Thread must access section in section k A word , also half-warp The address of the accessed segment must be aligned to the word length of each thread 16 times . Only support for word length 32bit、64bit、128bit Merge access to data of .
- stay 1.2 And higher capacity equipment , Merger access requirements have been greatly relaxed , Support word length is 8bit( Corresponding segment length 32Byte)、16bit( Corresponding segment length 64Byte)、32bit/64bit/128bit( Corresponding segment length 128Byte) To access the data of .
The following describes 1.2/1.3 One of the capabilities of hardware half-warp How to complete a merge visit .
- First , Find the active thread with the lowest thread number ( front half-warp Thread in 0, Or after half-warp Thread in 16) The section where the address requested is located . about 8bit In terms of data , The section length is 32Byte, about 16bit In terms of data, the segment length is 64Byte, about 32、64、128bit In terms of data, the segment length is 128Byte.
- then , Find the active thread whose address is also in this segment . If all threads access data in the first or second half of the segment , Then you can also reduce the data size of a transmission . for example , If the size of a segment is 128Byte, But only the top half or the bottom half is used , Then the actual size of the data transferred can be further reduced to 64Byte, Empathy , about 64Byte Combined transmission of segments of , If only the first half or the second half is used, it can be reduced to 32Byte.
- transmitted , here , The thread executing the memory access instruction will be inactive , Execution resources are released for SM The others in the ready state warp Use .
- Repeat the process , know half-warp All threads access ends .
It should be noted that , Through runtime API( Such as cudaMalloc
()) Allocated memory , It has been guaranteed that its first address will at least press 256Byte Align . therefore , Choose the appropriate thread block size ( for example 16 Integer multiple ), Can make half-warp Access requests are aligned by segment length . Use __align__(8) and __align__(16) Qualifier to define the structure , You can make access to the array of structs to align to segment .
If the access period is not aligned or interval access, the effective bandwidth will be greatly reduced . In the case of interval access to video memory , Can use shared memory To achieve .
Global memory allocation
When using CUDA Running time , Both the device pointer and the host pointer type are void*.
Dynamic memory allocation
majority CUDA The global memory in is allocated dynamically , Use cuda Runtime , The following functions are used to allocate and release global memory .
cudaError_t cudaMalloc(void **, size_t);
cudaError_t cudaFree(void);
The corresponding driver API Function is :
CUresult CUDAAPI cuMemAlloc(CUdeviceptr *dptr, size_t bytesize);
CUresult CUDAAPI cuMemFree(CUdeviceptr dptr);
It costs a lot to allocate global memory ,CUDA The driver implements a CUDA Sub allocator for small memory requests (suballocator), But if this suballocator You have to create a new block of memory , This calls for a very expensive kernel mode driver for the operating system . If this happens ,CUDA The driver must be associated with GPU Sync , This may interrupt CPU、GPU Concurrent , therefore , It's a good practice to avoid allocating or freeing global memory in code with high performance requirements .
Static memory allocation
By using __device__ Keyword tags can be marked in the memory declaration . This memory is made up of cuda The driver allocates when the module is loaded .
Runtime API:
cudaError_t cudaMemcpyToSymbol(
char *symbol,
const void *src,
size_t count,
size_t offset=0,
enum cudaMemcpyKind kind=cudaMemcpyHostToDevice
);
cudaError_t cudaMemcpyFromSymbol(
void *dst,
char *symbol,
size_t count,
size_t offset,
enum cudaMemcpyKind kind=cudaMemcpyDeviceToHost
);
cuda Runtime applications can call functions cudaGetSymbolAddress() Query pointers associated with statically allocated memory .
cudaError_t cudaGetSymbolAddress(void **devPtr, char *symbol);
The driver API:
CUresult CUDAAPI cuModuleGetGlobal(CUdeviceptr *dptr, size_t *bytes, CUmodule hmod, const char *name);
This function returns the base pointer and object size . If we don't need size , Can be in bytes Parameters of the incoming NULL.
Pointer query
cuda Track all memory allocations , And provide API Enable applications to query CUDA All the pointers in . Function libraries and plug-ins can use different processing strategies on the basis of .
struct cudaPointerAttributes{
enum cudaMemoryType memoryType;
int device;
void *devicePointer;
void *hostPointer;
}