当前位置:网站首页>Tiflash compiler oriented automatic vectorization acceleration
Tiflash compiler oriented automatic vectorization acceleration
2022-07-05 14:04:00 【Tidb community dry goods portal】
The source of the original :https://tidb.net/blog/1886d9cd
author : Zhuyifan
Catalog
- SIMD Introduce
<!---->
- SIMD Function dispatch scheme
<!---->
- Compiler oriented optimization
SIMD Introduce
SIMD Is an important means of program acceleration .CMU DB Group in Advanced Database Systems There are two special chapters in (vectorization-1, vectorization-2) Introduce SIMD The application of Vectorization in database , It can be seen that it is of great importance to modern database systems . This article briefly introduces some in TiFlash Introduction to using the compiler for automatic vectorization in .
TiFlash Currently supported architectures are x86-64 and Aarch64, The operating system platform has Linux and MacOS. Subject to the platform ISA And operating systems API, Do... In different environments SIMD Support has different problems .
X86-64
We traditionally put x86-64 The platform is divided into 4 individual Level:
- x86-64: CMOV, CMPXCHG8B, FPU, FXSR, MMX, FXSR, SCE, SSE, SSE2
<!---->
- x86-64-v2: (close to Nehalem) CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3
<!---->
- x86-64-v3: (close to Haswell) AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE
<!---->
- x86-64-v4: AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL
There are different extensions at each level . The status quo is TiFlash stay x86-64 The goal of compiling on is x86-64-v2, At present, the vast majority of home and server CPU Have support x86-64-v3. because Intel At present, we are facing the update of large and small core architectures ,x86-64-v4 Support for is relatively chaotic , But on the server side , The newer models have different degrees of AVX512 Support . stay AWS Support matrix of We can see the support of the third generation Zhiqiang expandable processor in AVX512 The model of has been adopted in the production environment .
x86-64 It's different CPU The overhead of the same extended instruction set before the architecture is also different , Generally speaking , Can be in Intel Intrinsic Guide To briefly view the related instructions on different microarchitectures CPI Information . If you want to optimize for a specific platform , You can read platform related Tuning Guides and Performance Analysis Papers ,INTEL ADVANCED VECTOR EXTENSIONS as well as Intel 64 and IA-32 Architectures Software Developer Manuals (Software Optimization Reference Manual series ) To obtain a Intel Official advice .
How to choose SSE,AVX/AVX2,AVX512? In fact, it is not that the technology is more and more new , The larger the bit width , The effect must be better . Such as , stay INTEL ADVANCED VECTOR EXTENSIONS Of 2.8 Chapter we can see , Mix tradition SSE and AVX Command assembly leads to what is called SSE-AVX Transition Penalty:

On the other hand ,AVX2,AVX512 There are corresponding Frequency Scaling problem .Cloudflare The article On the dangers of Intel's frequency scaling as well as Gathering Intel on Intel AVX-512 Transitions There is an analysis of this problem . Briefly ,AVX-512 Improve performance in intensive computing , here CPU The frequency drops , However, vectorization itself greatly improves the speed . however , If it is mixed in non dense scenes AVX512 And general instructions , We can imagine the loss of overall performance caused by frequency reduction .
stay Intel On the platform ,SIMD The instruction set corresponds to XMM,YMM,ZMM And so on , We can use gdb Of disassmble Command to see the result of vectorization :
#!/usr/bin/env bashargs=(-batch -ex "file $1")while IFS= read -r line; do args+=("-ex" "disassemble '$line'")done < <(nm --demangle $1 | grep $2 | cut -d\ -f3-)gdb "${args[@]}" | c++filt# bash ./this-script.sh tiflash xxx
Aarch64
stay Arm There is also a problem of uneven platform vectorization instruction set support in the world .Arm V8 At present, it has been refined into 8 A version :

stay SIMD aspect ,Aarch64 There are mainly two or three instruction sets ASIMD,SVE,SVE2.ASIMD Has been widely used , in fact , GCC/Clang It will open by default ASIMD Support . stay Arm V8 in ,SVE Generally not in A Profile To realize , But for HPC Etc CPU in . stay Arm V9 in ,SVE,SVE2 It has become the standard extended instruction set .
ASIMD It describes the fixed length vectorization operation , Act on 64bit and 128bit The register of , Functionally and SSE The series is close to .SVE Is to use a variable length vector ,Vendor Can provide up to 2048bit Super wide register of . Use Per-Lane Prediction The plan ,SVE The instruction set establishes a programming model without knowing the actual register width .

in application ,AWS C7g ( be based on AWS Graviton3) Has begun to support SVE Instruction set , Up to 256bit Width . and ASIMD In Kunpeng ,AWS Graviton2 etc. CPU Has a good implementation on the instances of .
stay AARCH64 On , common ASIMD The associated register is q0-q15, Sometimes they also take the form of v0-v15 The suffixed form appears in ASM in .SVE And so on z0-z15.

SIMD Function dispatch scheme
TiFlash Of CD Pipeline For each OS/Arch Combine to generate a unified binary package for publishing , Therefore, the goal of overall compilation is a relatively general architecture . and SIMD Instruction sets differ from platform to platform , So we need some scheme to distribute the vectorized function . The following two categories of solutions are provided , Runtime and load time . As a whole , You can refer to the following conditions to select :
- If you want to support non Linux The goal is , And it is known that the operation itself takes relatively more time , Don't care about oneortwo more branch, You can use runtime dispatch . under these circumstances ,TiFlash There is a runtime switch that provides the corresponding vectorization scheme , More controllable functions .
<!---->
- If the operation is used in an extremely large number , And branch May affect performance , Priority can be given to load time distribution .TiFlash Basically used in production environments Linux, So you can just MacOS Provides the default version of the function .
Run time dispatch
This scheme is relatively simple , stay common/detect_features.h in ,TiFlash Provides specific information for inspection CPU A better solution , We can write a runtime check function , Then determine the function entry of the specific implementation scheme . This scheme is applicable to the known vectorization operation which takes a long time , Compared with the situation where the distribution cost can be ignored .
Look at the following code :
__attribute__((target("avx512f"))) void test4096_avx512(bool * __restrict a, const int * __restrict b){ for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; }}__attribute__((target("avx2"))) void test4096_avx2(bool * __restrict a, const int * __restrict b){ for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; }}__attribute__((noinline)) void test4096_generic(bool * __restrict a, const int * __restrict b){ for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; }}void test4096(bool * __restrict a, const int * __restrict b){ if (common::cpu_feature_flags.avx512f) { return test4096_avx512(a, b); } if (common::cpu_feature_flags.avx2) { return test4096_avx2(a, b); } return test4096_generic(a, b);}You can see , Function entry is the detection function , Implementation of call corresponding platform :

The specific function has the vectorization optimization of the corresponding platform


actually , For the distribution of this same function body ,TiFlash Packaged... Has been provided macro, The above code can be written as
#include <Common/TargetSpecific.h>TIFLASH_DECLARE_MULTITARGET_FUNCTION( /* return type */ void, /* function name */ test4096, /* argument names */ (a, b), /* argument list */ (bool * __restrict a, const int * __restrict b), /* body */ { for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; } })IFUNC distributed
stay Linux Upper observation Glibc The symbol table of :

We can see , Some performance critical functions are marked before i Symbol . This means that these functions are indirect function : That is, the program can provide multiple implementations of a function , Then, in the program loading link phase, the ld Decide which implementation the target symbol is specifically linked to .Glibc It is this scheme that is used to determine some key functions such as memcpy/memcmp/memset Such as the implementation of the .
test4096 Can rewrite :
void test4096(bool * __restrict a, const int * __restrict b) __attribute__((ifunc("test4096_resolver")));extern "C" void * test4096_resolver(){ if (__builtin_cpu_supports("avx512f")) return reinterpret_cast<void *>(&test4096_avx512); if (__builtin_cpu_supports("avx2")) return reinterpret_cast<void *>(&test4096_avx2); return reinterpret_cast<void *>(&test4096_generic);}This scheme reduces the overhead of runtime dispatch , But there are some limitations :
- Only applicable to GNU/Linux platform
<!---->
- ifunc Of resolver It has to be in the current unit Inside . If resolver yes c++ Function of , Need to provide mangle The name after .
<!---->
resolver Execute on entering C Runtime sum C++ Before runtime , Out-of-service TiFlash Detection function of . stay
x86_64platform , have access to__builtin_cpu_supports; stayaarch64On , You can use the following scheme :#include <sys/auxv.h>#ifndef HWCAP2_SVE2#define HWCAP2_SVE2 (1 << 1)#endif#ifndef HWCAP_SVE#define HWCAP_SVE (1 << 22)#endif#ifndef AT_HWCAP2#define AT_HWCAP2 26#endif#ifndef AT_HWCAP#define AT_HWCAP 16#endifnamespace detail{static inline bool sve2_supported(){ auto hwcaps = getauxval(AT_HWCAP2); return (hwcaps & HWCAP2_SVE2) != 0;}static inline bool sve_supported(){ auto hwcaps = getauxval(AT_HWCAP); return (hwcaps & HWCAP_SVE) != 0;}} // namespace detailAnother interesting example is , If you need in resolver Read function variables in , You may need to manually initialize environ The pointer :
extern char** environ;extern char **_dl_argv;char** get_environ() { int argc = *(int*)(_dl_argv - 1); char **my_environ = (char**)(_dl_argv + argc + 1); return my_environ;}typeof(f1) * resolve_f() { environ = get_environ(); const char *var = getenv("TOTO"); if (var && strcmp(var, "ok") == 0) { return f2; } return f1;}int f() __attribute__((ifunc("resolve_f")));
Function Multiversioning distributed
stay x86-64 On ,Clang/GCC It actually provides a more convenient IFUNC Implementation scheme :
#include <iostream>__attribute__((target("avx512f"))) void test4096(bool * __restrict a, const int * __restrict b){ std::cout << "using avx512" << std::endl; for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; }}__attribute__((target("avx2"))) void test4096(bool * __restrict a, const int * __restrict b){ std::cout << "using avx2" << std::endl; for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; }}__attribute__((target("default"))) void test4096(bool * __restrict a, const int * __restrict b){ std::cout << "using default" << std::endl; for (int i = 0; i < 4096; ++i) { a[i] = b[i] > 0; }}int main() { bool results[4096]; int data[4096]; for (auto & i : data) { std::cin >> i; } test4096(results, data); for (const auto & i : results) { std::cout << i << std::endl; }}here , We don't have to distinguish between function names and providing resolver, Instead, they directly mark different target, The compiler generates it automatically ifunc The implementation of the .

Macro Integrate
You can use the following code to integrate x86-64 and aarch64 Based on IFUNC The plan :
#ifdef __linux__#include <sys/auxv.h>#ifndef HWCAP2_SVE2#define HWCAP2_SVE2 (1 << 1)#endif#ifndef HWCAP_SVE#define HWCAP_SVE (1 << 22)#endif#ifndef AT_HWCAP2#define AT_HWCAP2 26#endif#ifndef AT_HWCAP#define AT_HWCAP 16#endifnamespace detail{static inline bool sve2_supported(){ auto hwcaps = getauxval(AT_HWCAP2); return (hwcaps & HWCAP2_SVE2) != 0;}static inline bool sve_supported(){ auto hwcaps = getauxval(AT_HWCAP); return (hwcaps & HWCAP_SVE) != 0;}} // namespace detail#endif#define TMV_STRINGIFY_IMPL(X) #X#define TMV_STRINGIFY(X) TMV_STRINGIFY_IMPL(X)#define TIFLASH_MULTIVERSIONED_VECTORIZATION_X86_64(RETURN, NAME, ARG_LIST, ARG_NAMES, BODY) \ struct NAME##TiFlashMultiVersion \ { \ __attribute__((always_inline)) static inline RETURN inlined_implementation ARG_LIST BODY; \ \ __attribute__((target("default"))) static RETURN dispatched_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ __attribute__((target("avx"))) static RETURN dispatched_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ __attribute__((target("avx2"))) static RETURN dispatched_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ __attribute__((target("avx512f,avx512vl,avx512bw,avx512cd"))) static RETURN dispatched_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ __attribute__((always_inline)) static inline RETURN invoke ARG_LIST \ { \ return dispatched_implementation ARG_NAMES; \ }; \ };#define TIFLASH_MULTIVERSIONED_VECTORIZATION_AARCH64(RETURN, NAME, ARG_LIST, ARG_NAMES, BODY) \ struct NAME##TiFlashMultiVersion \ { \ __attribute__((always_inline)) static inline RETURN inlined_implementation ARG_LIST BODY; \ \ static RETURN generic_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ __attribute__((target("sve"))) static RETURN sve_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ __attribute__((target("sve2"))) static RETURN sve2_implementation ARG_LIST \ { \ return inlined_implementation ARG_NAMES; \ }; \ \ static RETURN dispatched_implementation ARG_LIST \ __attribute__((ifunc(TMV_STRINGIFY(__tiflash_mvec_##NAME##_resolver)))); \ \ __attribute__((always_inline)) static inline RETURN invoke ARG_LIST \ { \ return dispatched_implementation ARG_NAMES; \ }; \ }; \ extern "C" void * __tiflash_mvec_##NAME##_resolver() \ { \ if (::detail::sve_supported()) \ { \ return reinterpret_cast<void *>(&NAME##TiFlashMultiVersion::sve_implementation); \ } \ if (::detail::sve2_supported()) \ { \ return reinterpret_cast<void *>(&NAME##TiFlashMultiVersion::sve2_implementation); \ } \ return reinterpret_cast<void *>(&NAME##TiFlashMultiVersion::generic_implementation); \ }#if defined(__linux__) && defined(__aarch64__)#define TIFLASH_MULTIVERSIONED_VECTORIZATION TIFLASH_MULTIVERSIONED_VECTORIZATION_AARCH64#elif defined(__linux__) && defined(__x86_64__)#define TIFLASH_MULTIVERSIONED_VECTORIZATION TIFLASH_MULTIVERSIONED_VECTORIZATION_X86_64#else#define TIFLASH_MULTIVERSIONED_VECTORIZATION(RETURN, NAME, ARG_LIST, ARG_NAMES, BODY) \ struct NAME##TiFlashMultiVersion \ { \ __attribute__((always_inline)) static inline RETURN invoke ARG_LIST BODY; \ };#endifTIFLASH_MULTIVERSIONED_VECTORIZATION( int, sum, (const int * __restrict a, int size), (a, size), { int sum = 0; for (int i = 0; i < size; ++i) { sum += a[i]; } return sum; })Compiler oriented optimization
LLVM It provides a good automatic vectorization guide : Auto-Vectorization in LLVM - LLVM 15.0.0git documentation
You can refer to the chapters to learn which common patterns can be used for vectorization . Simply speaking , We can think about the scenario of a cycle : Can you simplify unnecessary control flow , Can you reduce opaque function calls and so on . besides , You can also consider , For some simple function definitions , If it gets a lot of continuous calls , Can we define a function in header in , Let the compiler see and inline these functions , And then improve the space of vectorization .
Gartner said ,premature optimization is the root of all evil( Premature optimization is the source of all evils ). There is no need to rewrite some non performance critical loops into vectorization friendly forms for vectorization . combination profiler It is a good choice to further optimize those functions .
Check the vectorization condition
We use the following parameters to check the vectorization process :
-Rpass-missed='.*vectorize.*'Check why the compiler did not successfully vectorize
<!---->
-Rpass='.*vectorize.*'Check the compiler for those vectorizations
In particular , stay TiFlash, Let's first extract a object file Compile instruction
cat compile_commands.json | grep "/VersionFilterBlockInputStream.cpp" then , Add... Before compiling the directive -Rpass-missed='.*vectorize.*' perhaps -Rpass='.*vectorize.*' To see the information .


Loop unrolling Pragma
following pragma It can be used to control the loop expansion strategy , Auxiliary vectorization
void test1(int * a, int *b, int *c) { #pragma clang loop unroll(full) for(int i = 0; i < 1024; ++i) { c[i] = a[i] + b[i]; }}void test2(int * a, int *b, int *c) { #pragma clang loop unroll(enable) for(int i = 0; i < 1024; ++i) { c[i] = a[i] + b[i]; }}void test3(int * a, int *b, int *c) { #pragma clang loop unroll(disable) for(int i = 0; i < 1024; ++i) { c[i] = a[i] + b[i]; }}void test4(int * a, int *b, int *c) { #pragma clang loop unroll_count(2) for(int i = 0; i < 1024; ++i) { c[i] = a[i] + b[i]; }}To quantify Pragma
following pragma You can suggest clang To quantify .
static constexpr int N = 4096;int A[N];int B[N];struct H { double a[4]; H operator*(const H& that) { return { a[0] * that.a[0], a[1] * that.a[1], a[2] * that.a[2], a[3] * that.a[3], }; }};H C[N];H D[N];H E[N];void test1() { #pragma clang loop vectorize(enable) for (int i=0; i < N; i++) { C[i] = D[i] * E[i]; }}void test2() { for (int i=0; i < N; i++) { C[i] = D[i] * E[i]; }}
in fact , stay Aarch64 On ,TiFlash in getDelta By default, there is no vectorization , While using hint Then you can .
these pragma If you want to macro For internal use , It can be changed to _Pragma("clang loop vectorize(enable)") In the form of .
Loop split
Reuse the example above
void x() { #pragma clang loop vectorize(enable) for (int i=0; i < N; i++) { A[i + 1] = A[i] + B[i]; C[i] = D[i] * E[i]; }}void y() { for (int i=0; i < N; i++) { A[i + 1] = A[i] + B[i]; } #pragma clang loop vectorize(enable) for (int i=0; i < N; i++) { C[i] = D[i] * E[i]; }} among x Functions are not vectorized , because A There are data dependencies in .y Split two loop after , After a loop Then vectorization can be performed . In practice , If C[i] = D[i] * E[i] The scalar operation of is relatively time consuming , It makes sense to do this .
Theoretically
#pragma clang loop distribution(enable)The corresponding situation can be handled automatically , But even if you use this pragma,clang Still relatively conservative .
Control vectorization strategy
Resize the unit vector
void test(char *x, char *y, char * z) { #pragma clang loop vectorize_width(8) for (int i=0; i < 4096; i++) { x[i] = y[i] * z[i]; }}For example Aarch64 On ,vectorize_width(1) It means that there is no vectorization ,vectorize_width(8) It means using 64bit register ,vectorize_width(16) It means using 128bit register .
besides , You can also use vectorize_width(fixed) , vectorize_width(scalable) Adjust the orientation of fixed length and variable length vectors .
Adjust the vectorized batch size
It can be used interleave_count(4) Suggest to the compiler the loop batch expanded during vectorization . Increasing the batch size within a certain range can promote the processor to accelerate with superscalar and out of order execution .
void test(char *x, char *y, char * z) { #pragma clang loop vectorize_width(8) interleave_count(4) for (int i=0; i < 4096; i++) { x[i] = y[i] * z[i]; }}Extract the fixed length cycle unit
The following function is used to confirm the first visible column in the database inventory :
const uint64_t* filterRow( const uint64_t* data, size_t length, uint64_t current_version) { for(size_t i = 0; i < length; ++i) { if (data[i] > current_version) { return data + i; } } return nullptr;}It cannot be vectorized , Because there is an outward jump control flow inside the loop .
In this case , You can manually extract a loop to help the compiler do automatic vectorization :
const uint64_t* filterRow( const uint64_t* data, size_t length, uint64_t current_version) { size_t i = 0; for(; i + 64 < length; i += 64) { uint64_t mask = 0; #pragma clang loop vectorize(enable) for (size_t j = 0; j < 64; ++j) { mask |= data[i + j] > current_version ? (1ull << j) : 0; } if (mask) { return data + i + __builtin_ctzll(mask); } } for(; i < length; ++i) { if (data[i] > current_version) { return data + i; } } return nullptr;}(__builtin_ctzll Is used to calculate the end of an integer 0 The number of compiler built-in functions , Generally, it can be translated into an instruction efficiently )
边栏推荐
- Anchor navigation demo
- Linux下mysql数据库安装教程
- Ordering system based on wechat applet
- Elfk deployment
- VC开发非MFC程序内存泄漏跟踪代码
- 【云资源】云资源安全管理用什么软件好?为什么?
- tidb-dm报警DM_sync_process_exists_with_error排查
- [South China University of technology] information sharing of postgraduate entrance examination and re examination
- 瑞能实业IPO被终止:年营收4.47亿 曾拟募资3.76亿
- 广发期货排名多少?网上办理广发期货开户安全可靠吗?
猜你喜欢

治臻新能源冲刺科创板:年营收2.2亿 上汽创投是股东

Xampp configuring multiple items
Jetpack Compose入门到精通

Recommendation number | what are interesting people looking at?

Simple process of penetration test

Those things I didn't know until I took the postgraduate entrance examination

-Web direction attack and defense world

金融壹賬通香港上市:市值63億港元 葉望春稱守正篤實,久久為功

Introduction to Chapter 8 proof problem of njupt "Xin'an numeral base"
![[cloud resources] what software is good for cloud resource security management? Why?](/img/c2/85d6b4a956afc99c2dc195a1ac3938.png)
[cloud resources] what software is good for cloud resource security management? Why?
随机推荐
TDengine 社区问题双周精选 | 第三期
Request + BS4 crawl Netease cloud music popular comments
研究生可以不用学英语?只要考研英语或六级分数高!
Guofu hydrogen energy rushes to the scientific and Technological Innovation Board: it plans to raise 2billion yuan, and 360million yuan of accounts receivable exceed the revenue
C - Divisors of the Divisors of An Integer Gym - 102040C
LeetCode_67(二进制求和)
Set up a website with a sense of ceremony, and post it to the public 2/2 through the intranet
Comparison of several distributed databases
TiFlash 源码解读(四) | TiFlash DDL 模块设计及实现分析
Solution to the prompt of could not close zip file during phpword use
Routing in laravel framework
About the problem and solution of 403 error in wampserver
laravel-dompdf导出pdf,中文乱码问题解决
Assembly language - Beginner's introduction
Login interface code
:: ffff:192.168.31.101 what address is it?
The forked VM terminated without saying properly goodbye
Convolutional Neural Networks简述
让秒杀狂欢更从容:大促背后的数据库(下篇)
Rk3566 add LED