当前位置:网站首页>Storage space modifier in CUDA
Storage space modifier in CUDA
2022-07-02 06:28:00 【Little Heshang sweeping the floor】
Variable Memory Space Specifiers
The variable memory space specifier indicates the memory location of the variable on the device .
Nothing described in this section is declared in the device code __device__
、__shared__
and __constant__
Automatic variables of memory space specifiers usually reside in registers . however , In some cases , The compiler may choose to place it in local memory , This may have adverse performance consequences , Such as Device memory access Described in .
1 __device__
__device__
The memory space specifier declares a variable residing on the device .
At most one of the other memory space specifiers defined in the next three sections can be associated with __device__
Use it together , To further indicate which memory space the variable belongs to . If they don't exist , Then the variable :
- Resides in the global memory space ,
- Has the ability to create it CUDA The life cycle of context ,
- Each device has a different object ,
- From all threads and hosts in the grid through the runtime library (
cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()
) visit .
2. __constant__
__constant__
Memory space specifier , Optional and __device__
Use it together , Declare a variable :
- Resides in constant memory space ,
- Has the ability to create it CUDA The life cycle of context ,
- Each device has a different object ,
- From all threads and hosts in the grid through the runtime library (
cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()
) visit .
3 __shared__
__shared__
Memory space specifier , Optional and __device__
Use it together , Declare a variable :
- Resides in the shared memory space of the thread block ,
- Have a life cycle of blocks ,
- Each block has a different object ,
- It can only be accessed from all threads in the block ,
- No fixed address .
When declaring variables in shared memory as external arrays , for example :
extern __shared__ float shared[];
The size of the array is determined at startup ( see also Perform configuration ). All variables declared in this way start at the same address in memory , Therefore, the layout of variables in the array must be explicitly managed by offsets . for example , If you want to be equivalent to ,
short array0[128];
float array1[64];
int array2[256];
Arrays can be declared and initialized in the following ways :
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
Please note that , Pointers need to be aligned with the type they point to , So the following code doesn't work , because array1 Not aligned to 4 Bytes .
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[127];
}
surface 4 Lists the alignment requirements for built-in vector types .
4. managed
__managed__
Memory space specifier , Optional and __device__
Use it together , Declare a variable :
- Can be referenced from device and host code , for example , You can get its address , It can also be read or written directly from the device or host function .
- With application lifecycle .
For more details , see also__managed__
Memory space specifier .
5. restrict
nvcc adopt __restrict__
Keywords support restricted pointers .
C99 Restricted pointers are introduced in , To alleviate the presence of c The problem of aliasing in type languages , This problem suppresses various optimizations from code reordering to common subexpression elimination .
The following is an example affected by the aliasing problem , Using restricted pointers can help the compiler reduce the number of instructions :
void foo(const float* a,
const float* b,
float* c)
{
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0];
...
}
The effect here is to reduce the number of memory accesses and calculations . This is due to “ cache ” The load is balanced by the increased register pressure caused by the common sub expression .
Due to register pressure in many CUDA Code is a key problem , Therefore, due to the reduced occupancy , Using a restricted pointer will affect CUDA Code has a negative performance impact .
边栏推荐
- Cglib代理-代码增强测试
- 重载全局和成员new/delete
- 阿里云MFA绑定Chrome浏览器
- js中正则表达式的使用
- 20201025 Visual Studio2019 QT5.14 信号和槽功能的使用
- Bgp Routing preference Rules and notice Principles
- 记录一次RDS故障排除--RDS容量徒增
- The intern left a big hole when he ran away and made two online problems, which made me miserable
- Sentinel rules persist to Nacos
- CUDA中的线程层次
猜你喜欢
FE - 微信小程序 - 蓝牙 BLE 开发调研与使用
Name six schemes to realize delayed messages at one go
web自动中利用win32上传附件
The difference between session and cookies
华为MindSpore开源实习机试题
Sentinel 阿里开源流量防护组件
Shardingsphere JDBC
Redis - hot key issues
队列(线性结构)
Distributed transactions: the final consistency scheme of reliable messages
随机推荐
CUDA and Direct3D consistency
一起学习SQL中各种join以及它们的区别
FE - weex 开发 之 使用 weex-ui 组件与配置使用
Linear DP (split)
Redis---1. Data structure characteristics and operation
FE - 微信小程序 - 蓝牙 BLE 开发调研与使用
Golang -- map capacity expansion mechanism (including source code)
稀疏数组(非线性结构)
When requesting resttemplate, set the request header, request parameters, and request body.
深入学习JVM底层(五):类加载机制
ctf三计
深入学习JVM底层(二):HotSpot虚拟机对象
一口气说出 6 种实现延时消息的方案
In depth understanding of JUC concurrency (I) what is JUC
压力测试修改解决方案
TensorRT的命令行程序
LeetCode 39. Combined sum
Data science [viii]: SVD (I)
10 erreurs classiques de MySQL
Detailed explanation of BGP message