当前位置:网站首页>Thread hierarchy in CUDA
Thread hierarchy in CUDA
2022-07-02 06:28:00 【Little Heshang sweeping the floor】
Thread hierarchy
For convenience ,threadIdx It's a 3 Component vector , Therefore, one dimension can be used 、 Two or three-dimensional thread index to identify threads , Form a one-dimensional 、 2D or 3D thread block , be called block. This provides a cross domain element ( For example, vector 、 Matrix or volume ) Call the method of calculation .
The index of the thread and its thread ID Relate to each other in a direct way : For one-dimensional blocks , They are the same ; For size (Dx, Dy) 2D block of , The index for (x, y) Thread of thread ID by (x + y*Dx); For size (Dx, Dy, Dz) Three dimensional blocks , The index for (x, y, z) Thread of thread ID by (x + y*Dx + z*Dx*Dy).
for example , The following code will have two sizes NxN Matrix A and B Add up , And store the results in the matrix C in :
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
The number of threads per block is limited , Because all threads of a block should reside on the same processor core , And must share the limited memory resources of the core . In the current gpu On , A thread block may contain up to 1024 Threads .
however , A kernel can be executed by multiple thread blocks with the same shape , Therefore, the total number of threads is equal to the number of threads per block multiplied by the number of blocks .
Blocks are organized into one dimension 、 2D or 3D threaded block mesh (grid), As shown in the figure below . The number of thread blocks in a grid is usually determined by the size of the data being processed , It usually exceeds the number of processors in the system .

<<<...>>> The number of threads per block and the number of blocks per grid specified in the syntax can be int or dim3 type . As shown in the example above , You can specify 2D blocks or meshes .
Each block in the grid can be represented by a one-dimensional 、 A unique index identifier for two or three dimensions , The index can be accessed through the built-in blockIdx Variables are accessed in the kernel . The dimension of thread block can be through the built-in blockDim Variables are accessed in the kernel .
Expand previous MatAdd() Example to handle multiple blocks , The code is as follows .
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}
The thread block size is 16x16(256 Threads ), Although it is arbitrarily changed in this case , But this is a common choice . The mesh is created with enough blocks , In this way, each matrix element has a thread to process . For the sake of simplicity , This example assumes that the number of threads per grid in each dimension can be divided by the number of threads per block in this dimension , Although this is not the case .
The process block needs to be executed independently : It must be possible to execute them in any order , Parallel or serial . This independence requires that thread blocks can be scheduled in any order across any number of cores , As shown in the figure below , Enable programmers to write code that expands with the number of kernels .

Threads in the block can share data through some shared memory and coordinate memory access by synchronizing their execution . More precisely , You can call __syncthreads() Internal function to specify the synchronization point in the kernel ; __syncthreads() Act as a barrier , All threads in the block must wait , Before we can continue . Shared Memory An example of using shared memory is given . except __syncthreads() outside ,Cooperative Groups API It also provides a rich set of thread synchronization examples .
For efficient collaboration , Shared memory is low latency memory near each processor core ( It's like L1 cache ), also __syncthreads() It's lightweight .
边栏推荐
猜你喜欢

Idea announced a new default UI, which is too refreshing (including the application link)

pytest(1) 用例收集规则

ctf-web之练习赛

ShardingSphere-JDBC篇

VLAN experiment of switching technology

TensorRT的数据格式定义详解

Singleton mode compilation

阿里云MFA绑定Chrome浏览器

Hydration failed because the initial UI does not match what was rendered on the server. One of the reasons for the problem

Introduce two automatic code generators to help improve work efficiency
随机推荐
Sentinel rules persist to Nacos
Idea announced a new default UI, which is too refreshing (including the application link)
PgSQL学习笔记
实习生跑路留了一个大坑,搞出2个线上问题,我被坑惨了
Invalid operation: Load into table ‘sources_ orderdata‘ failed. Check ‘stl_ load_ errors‘ system table
深入了解JUC并发(二)并发理论
FE - Weex 使用简单封装数据加载插件为全局加载方法
深入学习JVM底层(五):类加载机制
Detailed definition of tensorrt data format
ctf三计
Introduce two automatic code generators to help improve work efficiency
日志(常用的日志框架)
Redis - hot key issues
FE - Eggjs 结合 Typeorm 出现连接不了数据库
ShardingSphere-JDBC篇
The intern left a big hole when he ran away and made two online problems, which made me miserable
pytest(1) 用例收集规则
Android - Kotlin 下使用 Room 遇到 There are multiple good constructors and Room will ... 问题
20201002 VS 2019 QT5.14 开发的程序打包
注解和反射详解以及运用