当前位置：网站首页>Storage space modifier in CUDA

Storage space modifier in CUDA

2022-07-02 06:28:00 【Little Heshang sweeping the floor】

Variable Memory Space Specifiers

The variable memory space specifier indicates the memory location of the variable on the device .

Nothing described in this section is declared in the device code __device__、__shared__ and __constant__ Automatic variables of memory space specifiers usually reside in registers . however , In some cases , The compiler may choose to place it in local memory , This may have adverse performance consequences , Such as Device memory access Described in .

1 device

__device__ The memory space specifier declares a variable residing on the device .

At most one of the other memory space specifiers defined in the next three sections can be associated with __device__ Use it together , To further indicate which memory space the variable belongs to . If they don't exist , Then the variable ：

Resides in the global memory space ,
Has the ability to create it CUDA The life cycle of context ,
Each device has a different object ,
From all threads and hosts in the grid through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) visit .

2. constant

__constant__ Memory space specifier , Optional and __device__ Use it together , Declare a variable ：

Resides in constant memory space ,
Has the ability to create it CUDA The life cycle of context ,
Each device has a different object ,
From all threads and hosts in the grid through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) visit .

3 shared

__shared__ Memory space specifier , Optional and __device__ Use it together , Declare a variable ：

Resides in the shared memory space of the thread block ,
Have a life cycle of blocks ,
Each block has a different object ,
It can only be accessed from all threads in the block ,
No fixed address .

When declaring variables in shared memory as external arrays , for example :

extern __shared__ float shared[];

The size of the array is determined at startup （ see also Perform configuration ）. All variables declared in this way start at the same address in memory , Therefore, the layout of variables in the array must be explicitly managed by offsets . for example , If you want to be equivalent to ,

short array0[128];
float array1[64];
int   array2[256];

Arrays can be declared and initialized in the following ways ：

extern __shared__ float array[];
__device__ void func()      // __device__ or __global__ function
{
    short* array0 = (short*)array; 
    float* array1 = (float*)&array0[128];
    int*   array2 =   (int*)&array1[64];
}

Please note that , Pointers need to be aligned with the type they point to , So the following code doesn't work , because array1 Not aligned to 4 Bytes .

extern __shared__ float array[];
__device__ void func()      // __device__ or __global__ function
{
    short* array0 = (short*)array; 
    float* array1 = (float*)&array0[127];
}

surface 4 Lists the alignment requirements for built-in vector types .

4. managed

__managed__ Memory space specifier , Optional and __device__ Use it together , Declare a variable ：

Can be referenced from device and host code , for example , You can get its address , It can also be read or written directly from the device or host function .
With application lifecycle .
For more details , see also __managed__ Memory space specifier .

5. restrict

nvcc adopt __restrict__ Keywords support restricted pointers .

C99 Restricted pointers are introduced in , To alleviate the presence of c The problem of aliasing in type languages , This problem suppresses various optimizations from code reordering to common subexpression elimination .

The following is an example affected by the aliasing problem , Using restricted pointers can help the compiler reduce the number of instructions ：

void foo(const float* a,
         const float* b,
         float* c)
{
    c[0] = a[0] * b[0];
    c[1] = a[0] * b[0];
    c[2] = a[0] * b[0] * a[1];
    c[3] = a[0] * a[1];
    c[4] = a[0] * b[0];
    c[5] = b[0];
    ...
}

The effect here is to reduce the number of memory accesses and calculations . This is due to “ cache ” The load is balanced by the increased register pressure caused by the common sub expression .

Due to register pressure in many CUDA Code is a key problem , Therefore, due to the reduced occupancy , Using a restricted pointer will affect CUDA Code has a negative performance impact .

原网站

版权声明
本文为[Little Heshang sweeping the floor]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020612427507.html