当前位置:网站首页>Storage space modifier in CUDA
Storage space modifier in CUDA
2022-07-02 06:28:00 【Little Heshang sweeping the floor】
Variable Memory Space Specifiers
The variable memory space specifier indicates the memory location of the variable on the device .
Nothing described in this section is declared in the device code __device__、__shared__ and __constant__ Automatic variables of memory space specifiers usually reside in registers . however , In some cases , The compiler may choose to place it in local memory , This may have adverse performance consequences , Such as Device memory access Described in .
1 __device__
__device__ The memory space specifier declares a variable residing on the device .
At most one of the other memory space specifiers defined in the next three sections can be associated with __device__ Use it together , To further indicate which memory space the variable belongs to . If they don't exist , Then the variable :
- Resides in the global memory space ,
- Has the ability to create it CUDA The life cycle of context ,
- Each device has a different object ,
- From all threads and hosts in the grid through the runtime library (
cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) visit .
2. __constant__
__constant__ Memory space specifier , Optional and __device__ Use it together , Declare a variable :
- Resides in constant memory space ,
- Has the ability to create it CUDA The life cycle of context ,
- Each device has a different object ,
- From all threads and hosts in the grid through the runtime library (
cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()) visit .
3 __shared__
__shared__ Memory space specifier , Optional and __device__ Use it together , Declare a variable :
- Resides in the shared memory space of the thread block ,
- Have a life cycle of blocks ,
- Each block has a different object ,
- It can only be accessed from all threads in the block ,
- No fixed address .
When declaring variables in shared memory as external arrays , for example :
extern __shared__ float shared[];
The size of the array is determined at startup ( see also Perform configuration ). All variables declared in this way start at the same address in memory , Therefore, the layout of variables in the array must be explicitly managed by offsets . for example , If you want to be equivalent to ,
short array0[128];
float array1[64];
int array2[256];
Arrays can be declared and initialized in the following ways :
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
Please note that , Pointers need to be aligned with the type they point to , So the following code doesn't work , because array1 Not aligned to 4 Bytes .
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];
}
surface 4 Lists the alignment requirements for built-in vector types .
4. managed
__managed__ Memory space specifier , Optional and __device__ Use it together , Declare a variable :
- Can be referenced from device and host code , for example , You can get its address , It can also be read or written directly from the device or host function .
- With application lifecycle .
For more details , see also__managed__Memory space specifier .
5. restrict
nvcc adopt __restrict__ Keywords support restricted pointers .
C99 Restricted pointers are introduced in , To alleviate the presence of c The problem of aliasing in type languages , This problem suppresses various optimizations from code reordering to common subexpression elimination .
The following is an example affected by the aliasing problem , Using restricted pointers can help the compiler reduce the number of instructions :
void foo(const float* a,
const float* b,
float* c)
{
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0];
...
}
The effect here is to reduce the number of memory accesses and calculations . This is due to “ cache ” The load is balanced by the increased register pressure caused by the common sub expression .
Due to register pressure in many CUDA Code is a key problem , Therefore, due to the reduced occupancy , Using a restricted pointer will affect CUDA Code has a negative performance impact .
边栏推荐
- LeetCode 40. Combined sum II
- ctf-web之练习赛
- Is there a really free applet?
- Monitoring uplink of VRRP
- LeetCode 39. Combined sum
- NodeJs - Express 中间件修改 Header: TypeError [ERR_INVALID_CHAR]: Invalid character in header content
- Redis——缓存击穿、穿透、雪崩
- Introduce two automatic code generators to help improve work efficiency
- 稀疏数组(非线性结构)
- 深入了解JUC并发(一)什么是JUC
猜你喜欢

Name six schemes to realize delayed messages at one go

Linear DP (split)

深入学习JVM底层(四):类文件结构

Hydration failed because the initial UI does not match what was rendered on the server. One of the reasons for the problem

数据科学【九】:SVD(二)

最新CUDA环境配置(Win10 + CUDA 11.6 + VS2019)

unittest.TextTestRunner不生成txt测试报告

pytest(2) mark功能

实习生跑路留了一个大坑,搞出2个线上问题,我被坑惨了

日期时间API详解
随机推荐
利用NVIDIA GPU将Minecraft场景渲染成真实场景
Redis——缓存击穿、穿透、雪崩
Tensorrt command line program
web自动化切换窗口时报错“list“ object is not callable
Don't use the new WP collection. Don't use WordPress collection without update
IDEA公布全新默认UI,太清爽了(内含申请链接)
日志 - 7 - 记录一次丢失文件(A4纸)的重大失误
Golang--map扩容机制(含源码)
数据科学【九】:SVD(二)
BGP routing optimization rules and notification principles
LeetCode 283. Move zero
Use of Arduino wire Library
In depth understanding of JUC concurrency (I) what is JUC
Detailed explanation of BGP message
ctf三计
LeetCode 27. Removing Elements
Log (common log framework)
构建学习tensorflow
pytest(1) 用例收集规则
pytest(3)parametrize参数化