当前位置:网站首页>Warp shuffle in CUDA
Warp shuffle in CUDA
2022-07-02 06:27:00 【Little Heshang sweeping the floor】
CUDA Medium Warp Shuffle
Warp Shuffle Functions
__shfl_sync、__shfl_up_sync、__shfl_down_sync and __shfl_xor_sync
stay warp Exchange variables between threads in .
By computing power 3.x Or higher device support .
Discard notice :__shfl、__shfl_up、__shfl_down and __shfl_xor
stay CUDA 9.0 Has been discarded for all devices .
Delete notification : When facing with 7.x Or devices with higher computing power ,__shfl、__shfl_up、__shfl_down and __shfl_xor
No longer available , Instead, use their synchronous variants .
The author added : Here may be what you will mention next threadIndex, warpIdx, laneIndex It will be confused . Then I will use the following figure to illustrate .
1. Synopsis
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
T
It can be int、unsigned int、long、unsigned long、long long、unsigned long long、float or double
. contain cuda_fp16.h
After the header file ,T
It can also be __half or __half2
. Again , contain cuda_bf16.h After the header file ,T It can also be __nv_bfloat16 or __nv_bfloat162
.
2. Description
__shfl_sync()
Intrinsic functions are allowed in warp Exchange variables between threads in , Without using shared memory . Exchange occurs simultaneously in warp All active threads in ( And mask
name ), Move each thread according to the type 4 or 8 Bytes of data .
warp Threads in are called channels (lanes), And may have between 0 and warpSize-1( Include ) Index between . Four source channels are supported (source-lane) Addressing mode :
__shfl_sync()
Copy directly from the index channel
__shfl_up_sync()
From relative to the caller ID Lower channel replication
__shfl_down_sync()
From a higher relative to the caller ID Channel replication of
__shfl_xor_sync()
Based on its own channel ID Bit by bit Exclusive or
Copy from channel
Threads can only participate actively __shfl_sync()
Another thread of the command reads data . If the target thread is inactive , Then the retrieved value is undefined .
all __shfl_sync()
Intrinsic functions take an optional width parameter , This parameter changes the behavior of the intrinsic function . width
The value of must be 2 The power of ; If width
No 2 The power of , Or greater than warpSize
The number of , The result is undefined .
__shfl_sync()
Return from srcLane
Given ID Thread held var
Value . If width
Less than warpSize
, be warp Each sub part of is represented as a separate entity , Its starting logical channel ID by 0. If srcLane
Out of range [0:width-1], Then the returned value corresponds to passing srcLane
srcLane modulo width Held by var
Value ( That is, in the same part ).
The author added : What was originally said here is a little convoluted , I'd better use pictures to illustrate . Note that the following four figures are made by the author , If there are questions , It's just the level of the author -_-!.
__shfl_up_sync()
Through the channel from the caller ID Subtract from delta To calculate the source channel ID. Returns the channel generated by ID The saved var
Value : actually , var
adopt delta
The channel moves up . If the width is less than warpSize
, be warp Each sub part of is represented as a separate entity , Start logical channel ID by 0. The source channel index does not surround the width value , So actually lower delta
The channel will remain unchanged .
__shfl_down_sync()
By way of delta Add the channel of the caller ID To calculate the source channel ID. Returns the channel generated by ID The saved var
Value : This will have var
Move down the delta
The effect of the channel . If width
Less than warpSize, be warp Each sub part of is represented as a separate entity , Start logical channel ID by 0. as for __shfl_up_sync()
, Source channel ID The number does not surround the width value , therefore upper delta lanes Will remain unchanged .
__shfl_xor_sync()
Through the channel to the caller ID And laneMask
Perform bitwise XOR to calculate the source channel ID: Return the result channel ID Held by var
Value . If the width is less than warpSize, Then each group of threads with continuous width can access the elements in the early thread group , But if they try to access the elements in the following thread group , Will return their own var
value . This mode realizes a butterfly addressing mode , For example, it is used for tree protocol and broadcasting .
new *_sync shfl
The inner function takes a mask , Indicates the thread participating in the call . A representation thread channel must be set for each participating thread ID Bit , To ensure that they converge correctly before the hardware executes internal functions . All non exiting threads named in the mask must use the same mask to execute the same intrinsic function , Otherwise the result is undefined .
3. Notes
Threads can only participate actively __shfl_sync() Another thread of the command reads data . If the target thread is inactive , Then the retrieved value is undefined .
The width must be 2 The power of ( namely 2、4、8、16 or 32). Results with no other value specified .
4. Examples
4.1. Broadcast of a single value across a warp
#include <stdio.h>
__global__ void bcast(int arg) {
int laneId = threadIdx.x & 0x1f;
int value;
if (laneId == 0) // Note unused variable for
value = arg; // all threads except lane 0
value = __shfl_sync(0xffffffff, value, 0); // Synchronize all threads in warp, and get "value" from lane 0
if (value != arg)
printf("Thread %d failed.\n", threadIdx.x);
}
int main() {
bcast<<< 1, 32 >>>(1234);
cudaDeviceSynchronize();
return 0;
}
4.2. Inclusive plus-scan across sub-partitions of 8 threads
#include <stdio.h>
__global__ void scan4() {
int laneId = threadIdx.x & 0x1f;
// Seed sample starting value (inverse of lane ID)
int value = 31 - laneId;
// Loop to accumulate scan within my partition.
// Scan requires log2(n) == 3 steps for 8 threads
// It works by an accumulated sum up the warp
// by 1, 2, 4, 8 etc. steps.
for (int i=1; i<=4; i*=2) {
// We do the __shfl_sync unconditionally so that we
// can read even from threads which won't do a
// sum, and then conditionally assign the result.
int n = __shfl_up_sync(0xffffffff, value, i, 8);
if ((laneId & 7) >= i)
value += n;
}
printf("Thread %d final value = %d\n", threadIdx.x, value);
}
int main() {
scan4<<< 1, 32 >>>();
cudaDeviceSynchronize();
return 0;
}
4.3. Reduction across a warp
#include <stdio.h>
__global__ void warpReduce() {
int laneId = threadIdx.x & 0x1f;
// Seed starting value as inverse lane ID
int value = 31 - laneId;
// Use XOR mode to perform butterfly reduction
for (int i=16; i>=1; i/=2)
value += __shfl_xor_sync(0xffffffff, value, i, 32);
// "value" now contains the sum across all threads
printf("Thread %d final value = %d\n", threadIdx.x, value);
}
int main() {
warpReduce<<< 1, 32 >>>();
cudaDeviceSynchronize();
return 0;
}
边栏推荐
猜你喜欢
随机推荐
【每日一题】—华为机试01
Introduce two automatic code generators to help improve work efficiency
Hydration failed because the initial UI does not match what was rendered on the server. One of the reasons for the problem
Amazon AWS data Lake Work Pit 1
提高用户体验 防御性编程
Cglib代理-代码增强测试
Arduino Wire 库使用
TensorRT的命令行程序
Find the highest value of the current element Z-index of the page
Mech 3002 explanation
亚马逊aws数据湖工作之坑1
最新CUDA环境配置(Win10 + CUDA 11.6 + VS2019)
【张三学C语言之】—深入理解数据存储
BGP routing optimization rules and notification principles
Pbootcms collection and warehousing tutorial quick collection release
奇葩pip install
Sudo right raising
分布式事务 :可靠消息最终一致性方案
利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
加密压缩文件解密技巧