当前位置:网站首页>Neon optimization 2: arm optimization high frequency Instruction Summary
Neon optimization 2: arm optimization high frequency Instruction Summary
2022-06-30 19:05:00 【To know】
NEON Optimize 2:ARM Summary of optimized high frequency instructions
In the first chapter NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ? After the introduction , This blog mainly shares the high-frequency information summarized according to the optimization experience NEON Instructions .
Overall Division :
- Reading and writing : Data access and register reading and writing
- Calculation : Add, subtract, multiply, and add
- transformation : A wide 、 type 、 Reinterpretation
- operation : displacement 、 Compare 、 The absolute value 、 Maximum
Preface
Before understanding the instructions , You need to know what a vector is ? What is a vector line ?
- vector , It refers to different values stored in a register at the same time , such as float32x4_t type , That's it. 4 Vector of elements .
- Vector line , Is the specific element value in the vector .
And then , In order to understand the meaning of the instruction , You need to know what the instruction format specification is ?
Instruction style :int16x4_t vqmovn_s32(int32x4_t a);
- int16x4, Represents the type of return ,int16 Express 16 An integer ,x4 Expressed as 4 Vectors of elements
- vq Medium v Express vector Vector operations ,q Indicates saturation operation , For example, the overflow is truncated to the maximum value
- Such as
vmulq_n_f32in ,_ Ahead q Is the full bit width 128 An operation
- Such as
- movn Represents the operation type , Is to do bit width conversion operation
- s32 Indicates that the operation object is int32
Saturation operation :
- When the width of the character bit is narrowed , When the maximum value of the current type is exceeded, it will be automatically truncated to the maximum value of the type
Parallel bits :
- arm Platform NEON Parallel computing : The biggest support 128 position
- In the instruction , Type character before underscore , Add q by 128 An operation , Do not add as 64 An operation ;
- vld1
q_f32, Corresponding f32x4 The type of , full 128 position - vld1_f32, Corresponding f32x2 The type of , only 64 position
- vld1
Involving scalar operations , The function will have _
n_ As identification- Involving scalars :float32x4_t vmulq
_n_f32(float32x4_t a, float32_t b); - Scalars are not involved :float32x4_t vmulq_f32(float32x4_t a, float32x4_t b);
- Involving scalars :float32x4_t vmulq
Reading and writing
Data access
- Read data command :
- Instructions 1:vld1q_f32(float p), Read full dimension 128 Bit data ,324,4 individual 32 position float data
- Instructions 2:vld2q_f32(float p), read 2 A full dimension 128 Bit data ,324*2
- Instructions 3:vld4q_f32(float p), read 4 A full dimension 128 Bit data ,324*4
- effect : Read data from memory to NEON In the register
- Write data instructions :
- Instructions 1:void vst1q_f32(__transfersize(4) float32_t * ptr, float32x4_t val); // Copy 4 individual 32 Bit floating point , common 128 position
- Instructions 2:void vst1q_s16(__transfersize(8) int16_t * ptr, int16x8_t val); // Copy 8 individual 16 Bit integers , common 128 position
- Instructions 3:void vst1_s16(__transfersize(4) int16_t * ptr, int16x4_t val); // Copy 4 individual 16 Bit integers , common 64
- effect : take NEON The value of the register is stored in memory ( Normal floating point variables ), Storage NEON Vector to memory store
Set the vector line inside the vector
- Instructions :float32x4_t vdupq_n_f32(float32_t value);
- effect : Set all vector lines to the same value , Initializing a vector to a specific same value
- Be careful : The advanced command can adjust the specific position to set the corresponding value
Take the vector line in the vector
- Take the first two vector line instructions :float32x2_t vget_low_f32(float32x4_t a); // a1, a2, a3, a4 => a1, a2
- Take the last two vector line instructions :float32x2_t vget_high_f32(float32x4_t a); // a1, a2, a3, a4 => a3, a4
- effect : Take partial pairs of vectors , from 4 Take the first two and the last two of the values
Calculation
Add
- Instructions :int32x4_t vaddq_s32(int32x4_t a, int32x4_t b);
- effect :vr = a + b
Subtraction
- Instructions :int32x4_t vsubq_s32(int32x4_t a, int32x4_t b);
- effect :vr = a - b
Vector times scalar
- Instructions :float32x4_t vmulq_n_f32(float32x4_t a, float32_t b); // a1, a2, a3, a4;
- effect : Output is
vr = (a1, a2, a3, a4) * b
Vector and scalar multiplication plus
- Instructions :float32x4_t vmlaq_n_f32(float32x4_t a, float32x4_t b, float32_t c);
- effect : The multiplication and addition of vectors and scalars , The result is
vr = a + b * c - Particular attention : No a * b + c
transformation
Bit width conversion
- Narrow to wide instructions :int32x4_t vmovl_s16(int16x4_t a);
- effect : hold 4 individual 16 The bit value is extended to 4 individual 32 The number of bits , amount to :
int16_t a = 3; int32_t b = (int32_t)a; - Wide to narrow instructions :int16x4_t vqmovn_s32(int32x4_t a);
- effect : From wide to narrow characters , Due to possible overflow , So we need to do saturation operation
Type conversion
- Instructions :float32x4_t vcvtq_f32_s32(int32x4_t a);
- effect : take 32 Bit integer conversion to 32 Bit floating point ,cvt yes convert Abbreviation
Type reinterpretation
- Instructions :int8x16_t vreinterpretq_s8_f32(float32x4_t a); // take float32x4 Of a Interpreted as int8x16_t type
- effect : Vector reinterprets type conversion operations
- explain : Do not change the value itself , Decode the original binary values into different types
operation
Shift left and right
- Shift left command :uint32x4_t vshlq_n_u32(uint32x4_t a, __constrange(0,31) int b); // Move left ,b Range :[0, 31]
- Shift right command :uint32x4_t vshrq_n_u32(uint32x4_t a, __constrange(1,32) int b); // Move right ,b Range :[1, 32]
- effect : A vector is shifted left and right by a constant scalar
Absolute value of the difference
- Instructions :float32x4_t vabdq_f32(float32x4_t a, float32x4_t b);
- effect :vr = |a - b|
- explain : Can directly operate with scalar , Only multiplication ; Others such as maximum value , No addition or subtraction
Maximum
- Instructions :float32x4_t vmaxq_f32(float32x4_t a, float32x4_t b); // a1, a2, a3, a4; b1, b2, b3, b4;
- effect : Take the maximum value in pairs , The output value is :
[max(a1, b1), max(a2, b2), ..., max(a4, b4)]
Collapse max
- Instructions :float32x2_t vpmax_f32(float32x2_t a, float32x2_t b); // a1, a2; b1, b2;
- effect : Take the maximum value of phase zero pair , The output value is :
[max(a1, a2), max(b1, b2)]
Compare the size
Less than comparison :uint32x4_t vcltq_f32(float32x4_t a, float32x4_t b); // Judge a<b
<= Compare:uint32x4_t vcleq_s32(int32x4_t a, int32x4_t b); // Judge a<=b>= Compare:uint32x4_t vcgeq_s32(int32x4_t a, int32x4_t b); // Judge a>=bAbbreviated mnemonic :
clt(compare less than),cgt(compare grete than),ceq(comprae equal), ge(>=), le(<=)Return type : An unsigned number , The bit width is the same as the input parameter
Select by bit
- Instructions :int32x4_t vbslq_s32(uint32x4_t a, int32x4_t b, int32x4_t c);
- The usage function : take a Every one of the judges , if 1, The output b The middle corresponding bit ; Otherwise output c The middle corresponding bit
- matters needing attention : Usually compare the output of a function with a Use a combination of ,a Compare results for unsigned , Return value type and input value b/c The same type
Element inversion in a vector
Instructions :uint8x8_t vrev16_u8(uint8x8_t vec);
Mnemonic symbol :vrev(bit)_(type)
effect : The specified number of bits in the vector bit, Exchange in pairs
give an example :
uint8x8_t src = { 1,2,3,4,5,6,7,8}; dst = vrev16_u8(src) --> dst = { 2,1,4,3,6,5,8,7} // Press 16 Position as 1 team , Inside with 8 Bits are elements in reverse order dst = vrev64_u8(src) --> dst = { 8,7,6,5,4,3,2,1} // Press 64 Position as 1 team , Inside with 8 Bits are elements in reverse order
边栏推荐
- How to do a good job in software system demand research? Seven weapons make it easy for you to do it
- 秉持'家在中国'理念 2022 BMW儿童交通安全训练营启动
- 屏幕显示技术进化史
- 屏幕显示技术进化史
- 视频内容生产与消费创新
- Solution of enterprise supply chain system in medical industry: realize collaborative visualization of medical digital intelligent supply chain
- TCP粘包问题
- countdownlatch 和 completableFuture 和 CyclicBarrier
- 拓維信息使用 Rainbond 的雲原生落地實踐
- Sword finger offer 16 Integer power of numeric value
猜你喜欢

【TiDB】TiCDC canal_json的实际应用

Lenovo Yoga 27 2022, full upgrade of super configuration

The online procurement system of the electronic components industry accurately matches the procurement demand and leverages the digital development of the electronic industry

OneFlow源码解析:算子签名的自动推断

Coding officially entered Tencent conference application market!

【TiDB】TiCDC canal_ Practical application of JSON

不同制造工艺对PCB上的焊盘的影响和要求
![删除排序链表中的重复元素 II[链表节点统一操作--dummyHead]](/img/dd/7df8f11333125290b4b30183cfff64.png)
删除排序链表中的重复元素 II[链表节点统一操作--dummyHead]

ForkJoinPool

How to seamlessly transition from traditional microservice framework to service grid ASM
随机推荐
dtd建模
屏幕显示技术进化史
不同制造工艺对PCB上的焊盘的影响和要求
mysql下载和安装详细教程
slice
Large file transfer software based on UDP protocol
Entry node of link in linked list - linked list topic
The cloud native landing practice of using rainbow for Tuowei information
Detailed single case mode
TCP packet sticking problem
Is it safe to open a mobile stock account? Is it reliable?
SaaS project management system solution for the financial service industry helps enterprises tap a broader growth service space
com.alibaba.fastjson.JSONObject # toJSONString 消除循环引用
PO模式简介「建议收藏」
Rust 如何实现依赖注入?
CTF流量分析常见题型(二)-USB流量
英飞凌--GTM架构-Generic Timer Module
20220528【聊聊假芯片】贪便宜往往吃大亏,盘点下那些假的内存卡和固态硬盘
ONEFLOW source code parsing: automatic inference of operator signature
【合集- 行业解决方案】如何搭建高性能的数据加速与数据编排平台