当前位置：网站首页>Warp matrix functions in CUDA

Warp matrix functions in CUDA

2022-07-02 06:28:00 【Little Heshang sweeping the floor】

Warp matrix functions

C++ warp Matrix operation utilization Tensor Cores To speed up D=A*B+C Formal matrix problem . Ability to calculate 7.0 Mixed precision floating-point data of devices with or higher versions supports these operations . It needs a warp Cooperation of all threads in . Besides , Only if the condition is in the whole warp When the calculation results in are the same , It is allowed to perform these operations in the condition code , Otherwise, code execution may hang .

1. Description

All the following functions and types are in the namespace nvcuda::wmma In the definition of . Sub-byte The operation is considered a preview , That is, their data structure and API May change , And may be incompatible with future versions . This extra function is in nvcuda::wmma::experimental In the namespace .

template<typename Use, int m, int n, int k, typename T, typename Layout=void> class fragment;

void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm);
void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm, layout_t layout);
void store_matrix_sync(T* mptr, const fragment<...> &a, unsigned ldm, layout_t layout);
void fill_fragment(fragment<...> &a, const T& v);
void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c, bool satf=false);

fragment:

Overloaded classes that contain part of the matrix , Distribution in warp Of all threads in . Matrix elements to fragment The mapping of internal storage is unspecified , And it may change in the future architecture .

Only certain combinations of template parameters are allowed . The first template parameter specifies how the fragment will participate in matrix operations . Acceptable usage values are ：

matrix_a When fragment When used as the first multiplicand ,A
matrix_b When fragment When used as the second multiplicand ,B
When fragment Used as a source or destination accumulator （ Respectively C or D） Time accumulator .

m、n and k Size describes the warp-wide Shape of matrix block . Every tile The size of depends on its function . about matrix_a, The size of the block is m x k; about matrix_b, Dimension is k x n, The accumulator block is m x n.

For the multiplicand , data type T It can be double、float、__half、__nv_bfloat16、char or unsigned char, For accumulators , It can be double、float、int or __half. Such as Element type and matrix size Described in , Support limited combinations of accumulator and multiplicand types . It has to be for matrix_a and matrix_b Fragment assignment Layout Parameters . row_major or col_major They represent the matrix Row or column The elements in are continuous in memory . Of the accumulator matrix Layout The parameter should keep the default value void. Specify row or column layout only when loading or storing accumulators as described below .

load_matrix_sync:

Wait until all warp passageway (lanes) All arrived load_matrix_sync, Then load the matrix fragment from memory a. mptr Must be a 256 Bit aligned pointer , Points to the first element of the matrix in memory . ldm Describe consecutive lines （ For row main sequence ） Or column （ For column main order ） Element span between , about __half The element type must be 8 Multiple , For floating-point elements, the type must be 4 Multiple . （ namely , In both cases 16 Multiple of bytes ）. If fragment It's an accumulator , Then the layout parameter must be specified as mem_row_major or mem_col_major. about matrix_a and matrix_b fragment ,Layout It's from fragment Of Layout Inferred from the parameter . a Of mptr、ldm、layout And the values of all template parameters are for warp All threads in must be the same . This function must be warp All threads in call , Otherwise, the result is undefined .

store_matrix_sync:

Wait until all warp All channels arrive store_matrix_sync, Then the matrix fragment a Store in memory . mptr Must be a 256 Bit aligned pointer , Points to the first element of the matrix in memory . ldm Describe consecutive lines （ For row main sequence ） Or column （ For column main order ） Element span between , about __half The element type must be 8 Multiple , For floating-point elements, the type must be 4 Multiple . （ namely , In both cases 16 Multiple of bytes ）. The layout of the output matrix must be specified as mem_row_major or mem_col_major. a Of mptr、ldm、layout And the values of all template parameters are for warp All threads in must be the same .

fill_fragment:

Use constant v Fill matrix fragment . Since the mapping of matrix elements to each fragment is not specified , Therefore, this function is usually composed of warp All threads in call , And have common v value .

mma_sync:

Wait until all warp lanes All arrived mma_sync, And then execute warp Synchronous matrix multiplication and accumulation operation D=A*B+C. It also supports in situ (in-place) operation ,C=A*B+C. about warp All threads in , Of each matrix segment satf The values of and template parameters must be the same . Besides , Template parameter m、n and k Must be in segment A、B、C and D The matching between . This function must be defined by warp All threads in call , Otherwise the result is undefined .

If satf（ Saturated to a finite value –saturate to finite value） The mode is true , Then the following additional numeric attributes apply to the target accumulator ：

If the element result is +Infinity, Then the corresponding accumulator will contain +MAX_NORM
If the element result is -Infinity, Then the corresponding accumulator will contain -MAX_NORM
If the element result is NaN, Then the corresponding accumulator will contain +0

Since the mapping of matrix elements to each thread fragment is not specified , Therefore, it must be called in store_matrix_sync From memory （ Shared or global ） Access a single matrix element . stay warp In the special case that all threads in will uniformly apply element operations to all fragment elements , You can use the following fragment Class members implement direct element access .

enum fragment<Use, m, n, k, T, Layout>::num_elements;
T fragment<Use, m, n, k, T, Layout>::x[num_elements];

for example , The following code reduces the accumulator matrix by half .

wmma::fragment<wmma::accumulator, 16, 16, 16, float> frag;
float alpha = 0.5f; // Same value for all threads in warp
/*...*/
for(int t=0; t<frag.num_elements; t++)
frag.x[t] *= alpha;

2. Alternate Floating Point

Tensor Core Support in having 8.0 And higher computing power devices for alternative types of floating-point operations .

__nv_bfloat16:

This data format is another fp16 Format , Its scope is related to f32 identical , But the accuracy is reduced （7 position ）. You can directly compare this data format with cuda_bf16.h Provided in __nv_bfloat16 Type used together . have __nv_bfloat16 Matrix fragments of data type need to be combined with accumulators of floating-point type . Supported shapes and operations are similar to __half identical .

tf32:

This data format is Tensor Cores Special floating point formats supported , The scope and f32 identical , But the accuracy is reduced （>=10 position ）. The internal layout of this format is implementation defined . In order to be in WMMA This floating-point format is used in the operation , The input matrix must be manually converted to tf32 precision .

To facilitate conversion , Provides a new inline function __float_to_tf32. Although the input and output parameters of the inline function are floating-point types , But the output will be tf32. This new precision applies only to tensor cores , If mixed with other floating-point operations , The precision and scope of the results will be undefined .

Once the matrix is entered （matrix_a or matrix_b） Converted to tf32 precision , have precision::tf32 Precision segments and load_matrix_sync Of float The combination of data types will take advantage of this new feature . Both accumulator fragments must have a floating-point data type . The only supported matrix size is 16x16x8 (m-n-k).

The elements of the fragment are represented as floating-point numbers , So from element_type<T> To storage_element_type<T> The mapping is ：

precision::tf32 -> float

3. Double Precision

Tensor Core Support computing power 8.0 And higher devices . To use this new feature , Must use with double Fragment of type . mma_sync Operation will use .rn（ Round to the nearest even number ） The rounding modifier performs .

4. Sub-byte Operations

Sub-byte WMMA Operation provides an access Tensor Core The method of low precision function . They are considered preview functions , That is, their data structure and API May change , And may be incompatible with future versions . This function can be accessed through nvcuda::wmma::experimental Namespace get ：

namespace experimental { 
    namespace precision { 
        struct u4; // 4-bit unsigned 
        struct s4; // 4-bit signed 
        struct b1; // 1-bit 
   } 
    enum bmmaBitOp {
        bmmaBitOpXOR = 1, // compute_75 minimum
        bmmaBitOpAND = 2  // compute_80 minimum
    };
    enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; 
}

about 4 Bit accuracy , Usable API remain unchanged , But you must specify experimental::precision::u4 or experimental::precision::s4 As fragment data type . Because the elements of the fragment are packed together ,num_storage_elements Will be smaller than the num_elements. Sub-byte Fragment num_elements Variable , Therefore return Sub-byte type element_type<T> The number of elements . The same is true for unit accuracy , under these circumstances , from element_type<T> To storage_element_type<T> The mapping of is as follows ：

experimental::precision::u4 -> unsigned (8 elements in 1 storage element) 
experimental::precision::s4 -> int (8 elements in 1 storage element) 
experimental::precision::b1 -> unsigned (32 elements in 1 storage element) 
T -> T  //all other types

Sub-byte The allowable layout of fragments is always matrix_a Of row_major and matrix_b Of col_major.

For sub byte operations ,load_matrix_sync in ldm Value of for element type experimental::precision::u4 and Experimental::precision::s4 Should be 32 Multiple , Or for element types experimental::precision::b1 Should be 128 Multiple （ namely , In both cases 16 Multiple of bytes ）.

bmma_sync:
Wait until all warp lane Both have been implemented. bmma_sync, And then execute warp Synchronous bit matrix multiplication and accumulation operation D = (A op B) + C, among op By logical operation bmmaBitOp and bmmaAccumulateOp Defined cumulative composition . The available operations are ：

bmmaBitOpXOR,matrix_a A line in is related to matrix_b Of 128 Ranked 128 position XOR
bmmaBitOpAND,matrix_a A line in is related to matrix_b Of 128 Ranked 128 position AND, Can be used for computing power 8.0 And later .

The cumulative operation is always bmmaAccumulateOpPOPC, It calculates the number of set bits .

5. Restrictions

For each primary and secondary device architecture ,tensor cores The special format required may be different . Because the thread only holds fragments of the entire matrix （ Opaque architecture specific ABI data structure ）, Therefore, developers are not allowed to make assumptions about how to map the parameters to the registers involved in matrix multiplication and accumulation , This makes the situation more complicated .

Because fragments are architecture specific , If the function has been compiled for different link compatible architectures and linked together to become the same device executable , Then remove them from the function A Pass to function B It's not safe . under these circumstances , The size and layout of the fragments will be specific to one architecture , In another architecture WMMA API Will lead to incorrect results or potential damage .

An example of two link compatible architectures with different fragment layouts is sm_70 and sm_75.

fragA.cu: void foo() { wmma::fragment<...> mat_a; bar(&mat_a); }
fragB.cu: void bar(wmma::fragment<...> *mat_a) { // operate on mat_a }

// sm_70 fragment layout
$> nvcc -dc -arch=compute_70 -code=sm_70 fragA.cu -o fragA.o
// sm_75 fragment layout
$> nvcc -dc -arch=compute_75 -code=sm_75 fragB.cu -o fragB.o
// Linking the two together
$> nvcc -dlink -arch=sm_75 fragA.o fragB.o -o frag.o

This undefined behavior may not be detected by tools at compile time and runtime , Therefore, extra care is needed to ensure that the layout of fragments is consistent . When built with a compatible architecture for different links and expected to pass WMMA When linking legacy libraries of fragments , This kind of link danger is most likely to occur .

Please note that , In the case of weak links （ for example ,CUDA C++ Inline function ）, The linker may choose any available function definition , This may lead to implicit passing between compilation units .

To avoid such problems , The matrix should always be stored in memory for transmission through external interfaces （ for example wmma::store_matrix_sync(dst, ...);）, It can then be safely passed as a pointer type to bar() [ for example float *dst].

Please note that , because sm_70 Can be in sm_75 Up operation , Therefore, the above example sm_75 Code changed to sm_70 And in sm_75 Correct operation on . however , When with others sm_75 When linking binary files compiled separately , It is recommended to include sm_75 Native code .

6. Element Types & Matrix Sizes

Tensor core supports multiple element types and matrix sizes . The following table shows the supported matrix_a、matrix_b and accumulator Various combinations of matrices ：

Matrix A	Matrix B	Accumulator	Matrix Size (m-n-k)
__half	__half	float	16x16x16
__half	__half	float	32x8x16
__half	__half	float	8x32x16
__half	__half	__half	16x16x16
__half	__half	__half	32x8x16
__half	__half	__half	8x32x16
unsigned char	unsigned char	int	16x16x16
unsigned char	unsigned char	int	32x8x16
unsigned char	unsigned char	int	8x32x16
signed char	signed char	int	16x16x16
signed char	signed char	int	32x8x16
signed char	signed char	int	8x32x16

Alternate floating point support ：

Matrix A	Matrix B	Accumulator	Matrix Size (m-n-k)
__nv_bfloat16	__nv_bfloat16	float	16x16x16
__nv_bfloat16	__nv_bfloat16	float	32x8x16
__nv_bfloat16	__nv_bfloat16	float	8x32x16
precision::tf32	precision::tf32	float	16x16x8

Double fine support :

Matrix A	Matrix B	Accumulator	Matrix Size (m-n-k)
double	double	double	8x8x4

Yes sub-byte Experimental support for operation ：

Matrix A	Matrix B	Accumulator	Matrix Size (m-n-k)
precision::u4	precision::u4	int	8x8x32
precision::s4	precision::s4	int	8x8x32
precision::b1	precision::b1	int	8x8x128

7. Example

The following code is in a single warp To realize 16x16x16 Matrix multiplication :

#include <mma.h>
using namespace nvcuda;
      
__global__ void wmma_ker(half *a, half *b, float *c) {
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

原网站

版权声明
本文为[Little Heshang sweeping the floor]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020612427811.html