当前位置：网站首页>Neon optimization 2: arm optimization high frequency Instruction Summary

Neon optimization 2: arm optimization high frequency Instruction Summary

2022-06-30 19:05:00 【To know】

NEON Optimize 2：ARM Summary of optimized high frequency instructions

In the first chapter NEON Optimize 1： Software performance optimization 、 How to reduce power consumption ？ After the introduction , This blog mainly shares the high-frequency information summarized according to the optimization experience NEON Instructions .

Overall Division ：

Reading and writing ： Data access and register reading and writing
Calculation ： Add, subtract, multiply, and add
transformation ： A wide 、 type 、 Reinterpretation
operation ： displacement 、 Compare 、 The absolute value 、 Maximum

Preface

Before understanding the instructions , You need to know what a vector is ？ What is a vector line ？

vector , It refers to different values stored in a register at the same time , such as float32x4_t type , That's it. 4 Vector of elements .
Vector line , Is the specific element value in the vector .

And then , In order to understand the meaning of the instruction , You need to know what the instruction format specification is ？

Instruction style ：int16x4_t vqmovn_s32(int32x4_t a);
- int16x4, Represents the type of return ,int16 Express 16 An integer ,x4 Expressed as 4 Vectors of elements
- vq Medium v Express vector Vector operations ,q Indicates saturation operation , For example, the overflow is truncated to the maximum value
  - Such as vmulq_n_f32 in ,_ Ahead q Is the full bit width 128 An operation
- movn Represents the operation type , Is to do bit width conversion operation
- s32 Indicates that the operation object is int32
Saturation operation ：
- When the width of the character bit is narrowed , When the maximum value of the current type is exceeded, it will be automatically truncated to the maximum value of the type
Parallel bits ：
- arm Platform NEON Parallel computing ： The biggest support 128 position
- In the instruction , Type character before underscore , Add q by 128 An operation , Do not add as 64 An operation ;
  - vld1q_f32, Corresponding f32x4 The type of , full 128 position
  - vld1_f32, Corresponding f32x2 The type of , only 64 position
Involving scalar operations , The function will have _n_ As identification
- Involving scalars ：float32x4_t vmulq_n_f32(float32x4_t a, float32_t b);
- Scalars are not involved ：float32x4_t vmulq_f32(float32x4_t a, float32x4_t b);

Reading and writing

Data access

Read data command ：
- Instructions 1：vld1q_f32(float p), Read full dimension 128 Bit data ,324,4 individual 32 position float data
- Instructions 2：vld2q_f32(float p), read 2 A full dimension 128 Bit data ,324*2
- Instructions 3：vld4q_f32(float p), read 4 A full dimension 128 Bit data ,324*4
- effect ： Read data from memory to NEON In the register
Write data instructions ：
- Instructions 1：void vst1q_f32(__transfersize(4) float32_t * ptr, float32x4_t val); // Copy 4 individual 32 Bit floating point , common 128 position
- Instructions 2：void vst1q_s16(__transfersize(8) int16_t * ptr, int16x8_t val); // Copy 8 individual 16 Bit integers , common 128 position
- Instructions 3：void vst1_s16(__transfersize(4) int16_t * ptr, int16x4_t val); // Copy 4 individual 16 Bit integers , common 64
- effect ： take NEON The value of the register is stored in memory （ Normal floating point variables ）, Storage NEON Vector to memory store

Set the vector line inside the vector

Instructions ：float32x4_t vdupq_n_f32(float32_t value);
effect ： Set all vector lines to the same value , Initializing a vector to a specific same value
Be careful ： The advanced command can adjust the specific position to set the corresponding value

Take the vector line in the vector

Take the first two vector line instructions ：float32x2_t vget_low_f32(float32x4_t a); // a1, a2, a3, a4 => a1, a2
Take the last two vector line instructions ：float32x2_t vget_high_f32(float32x4_t a); // a1, a2, a3, a4 => a3, a4
effect ： Take partial pairs of vectors , from 4 Take the first two and the last two of the values

Calculation

Add

Instructions ：int32x4_t vaddq_s32(int32x4_t a, int32x4_t b);
effect ：vr = a + b

Subtraction

Instructions ：int32x4_t vsubq_s32(int32x4_t a, int32x4_t b);
effect ：vr = a - b

Vector times scalar

Instructions ：float32x4_t vmulq_n_f32(float32x4_t a, float32_t b); // a1, a2, a3, a4;
effect ： Output is vr = (a1, a2, a3, a4) * b

Vector and scalar multiplication plus

Instructions ：float32x4_t vmlaq_n_f32(float32x4_t a, float32x4_t b, float32_t c);
effect ： The multiplication and addition of vectors and scalars , The result is vr = a + b * c
Particular attention ： No a * b + c

transformation

Bit width conversion

Narrow to wide instructions ：int32x4_t vmovl_s16(int16x4_t a);
effect ： hold 4 individual 16 The bit value is extended to 4 individual 32 The number of bits , amount to ：int16_t a = 3; int32_t b = (int32_t)a;
Wide to narrow instructions ：int16x4_t vqmovn_s32(int32x4_t a);
effect ： From wide to narrow characters , Due to possible overflow , So we need to do saturation operation

Type conversion

Instructions ：float32x4_t vcvtq_f32_s32(int32x4_t a);
effect ： take 32 Bit integer conversion to 32 Bit floating point ,cvt yes convert Abbreviation

Type reinterpretation

Instructions ：int8x16_t vreinterpretq_s8_f32(float32x4_t a); // take float32x4 Of a Interpreted as int8x16_t type
effect ： Vector reinterprets type conversion operations
explain ： Do not change the value itself , Decode the original binary values into different types

operation

Shift left and right

Shift left command ：uint32x4_t vshlq_n_u32(uint32x4_t a, __constrange(0,31) int b); // Move left ,b Range ：[0, 31]
Shift right command ：uint32x4_t vshrq_n_u32(uint32x4_t a, __constrange(1,32) int b); // Move right ,b Range ：[1, 32]
effect ： A vector is shifted left and right by a constant scalar

Absolute value of the difference

Instructions ：float32x4_t vabdq_f32(float32x4_t a, float32x4_t b);
effect ：vr = |a - b|
explain ： Can directly operate with scalar , Only multiplication ; Others such as maximum value , No addition or subtraction

Maximum

Instructions ：float32x4_t vmaxq_f32(float32x4_t a, float32x4_t b); // a1, a2, a3, a4; b1, b2, b3, b4;
effect ： Take the maximum value in pairs , The output value is ：[max(a1, b1), max(a2, b2), ..., max(a4, b4)]

Collapse max

Instructions ：float32x2_t vpmax_f32(float32x2_t a, float32x2_t b); // a1, a2; b1, b2;
effect ： Take the maximum value of phase zero pair , The output value is ：[max(a1, a2), max(b1, b2)]

Compare the size

Less than comparison ：uint32x4_t vcltq_f32(float32x4_t a, float32x4_t b); // Judge a<b
<= Compare ：uint32x4_t vcleq_s32(int32x4_t a, int32x4_t b); // Judge a<=b
>= Compare ：uint32x4_t vcgeq_s32(int32x4_t a, int32x4_t b); // Judge a>=b
Abbreviated mnemonic ：clt（compare less than）,cgt（compare grete than）,ceq(comprae equal), ge(>=), le(<=)
Return type ： An unsigned number , The bit width is the same as the input parameter

Select by bit

Instructions ：int32x4_t vbslq_s32(uint32x4_t a, int32x4_t b, int32x4_t c);
The usage function ： take a Every one of the judges , if 1, The output b The middle corresponding bit ; Otherwise output c The middle corresponding bit
matters needing attention ： Usually compare the output of a function with a Use a combination of ,a Compare results for unsigned , Return value type and input value b/c The same type

Element inversion in a vector

Instructions ：uint8x8_t vrev16_u8(uint8x8_t vec);
Mnemonic symbol ：vrev(bit)_(type)
effect ： The specified number of bits in the vector bit, Exchange in pairs

give an example ：

uint8x8_t src = {
      1,2,3,4,5,6,7,8};
dst = vrev16_u8(src) --> dst = {
      2,1,4,3,6,5,8,7} //  Press 16 Position as 1 team , Inside with 8 Bits are elements in reverse order 
dst = vrev64_u8(src) --> dst = {
      8,7,6,5,4,3,2,1} //  Press 64 Position as 1 team , Inside with 8 Bits are elements in reverse order