当前位置:网站首页>Neon optimization 2: arm optimization high frequency Instruction Summary
Neon optimization 2: arm optimization high frequency Instruction Summary
2022-06-30 19:05:00 【To know】
NEON Optimize 2:ARM Summary of optimized high frequency instructions
In the first chapter NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ? After the introduction , This blog mainly shares the high-frequency information summarized according to the optimization experience NEON Instructions .
Overall Division :
- Reading and writing : Data access and register reading and writing
- Calculation : Add, subtract, multiply, and add
- transformation : A wide 、 type 、 Reinterpretation
- operation : displacement 、 Compare 、 The absolute value 、 Maximum
Preface
Before understanding the instructions , You need to know what a vector is ? What is a vector line ?
- vector , It refers to different values stored in a register at the same time , such as float32x4_t type , That's it. 4 Vector of elements .
- Vector line , Is the specific element value in the vector .
And then , In order to understand the meaning of the instruction , You need to know what the instruction format specification is ?
Instruction style :int16x4_t vqmovn_s32(int32x4_t a);
- int16x4, Represents the type of return ,int16 Express 16 An integer ,x4 Expressed as 4 Vectors of elements
- vq Medium v Express vector Vector operations ,q Indicates saturation operation , For example, the overflow is truncated to the maximum value
- Such as
vmulq_n_f32
in ,_ Ahead q Is the full bit width 128 An operation
- Such as
- movn Represents the operation type , Is to do bit width conversion operation
- s32 Indicates that the operation object is int32
Saturation operation :
- When the width of the character bit is narrowed , When the maximum value of the current type is exceeded, it will be automatically truncated to the maximum value of the type
Parallel bits :
- arm Platform NEON Parallel computing : The biggest support 128 position
- In the instruction , Type character before underscore , Add q by 128 An operation , Do not add as 64 An operation ;
- vld1
q
_f32, Corresponding f32x4 The type of , full 128 position - vld1_f32, Corresponding f32x2 The type of , only 64 position
- vld1
Involving scalar operations , The function will have _
n
_ As identification- Involving scalars :float32x4_t vmulq
_n_
f32(float32x4_t a, float32_t b); - Scalars are not involved :float32x4_t vmulq_f32(float32x4_t a, float32x4_t b);
- Involving scalars :float32x4_t vmulq
Reading and writing
Data access
- Read data command :
- Instructions 1:vld1q_f32(float p), Read full dimension 128 Bit data ,324,4 individual 32 position float data
- Instructions 2:vld2q_f32(float p), read 2 A full dimension 128 Bit data ,324*2
- Instructions 3:vld4q_f32(float p), read 4 A full dimension 128 Bit data ,324*4
- effect : Read data from memory to NEON In the register
- Write data instructions :
- Instructions 1:void vst1q_f32(__transfersize(4) float32_t * ptr, float32x4_t val); // Copy 4 individual 32 Bit floating point , common 128 position
- Instructions 2:void vst1q_s16(__transfersize(8) int16_t * ptr, int16x8_t val); // Copy 8 individual 16 Bit integers , common 128 position
- Instructions 3:void vst1_s16(__transfersize(4) int16_t * ptr, int16x4_t val); // Copy 4 individual 16 Bit integers , common 64
- effect : take NEON The value of the register is stored in memory ( Normal floating point variables ), Storage NEON Vector to memory store
Set the vector line inside the vector
- Instructions :float32x4_t vdupq_n_f32(float32_t value);
- effect : Set all vector lines to the same value , Initializing a vector to a specific same value
- Be careful : The advanced command can adjust the specific position to set the corresponding value
Take the vector line in the vector
- Take the first two vector line instructions :float32x2_t vget_low_f32(float32x4_t a); // a1, a2, a3, a4 => a1, a2
- Take the last two vector line instructions :float32x2_t vget_high_f32(float32x4_t a); // a1, a2, a3, a4 => a3, a4
- effect : Take partial pairs of vectors , from 4 Take the first two and the last two of the values
Calculation
Add
- Instructions :int32x4_t vaddq_s32(int32x4_t a, int32x4_t b);
- effect :vr = a + b
Subtraction
- Instructions :int32x4_t vsubq_s32(int32x4_t a, int32x4_t b);
- effect :vr = a - b
Vector times scalar
- Instructions :float32x4_t vmulq_n_f32(float32x4_t a, float32_t b); // a1, a2, a3, a4;
- effect : Output is
vr = (a1, a2, a3, a4) * b
Vector and scalar multiplication plus
- Instructions :float32x4_t vmlaq_n_f32(float32x4_t a, float32x4_t b, float32_t c);
- effect : The multiplication and addition of vectors and scalars , The result is
vr = a + b * c
- Particular attention : No a * b + c
transformation
Bit width conversion
- Narrow to wide instructions :int32x4_t vmovl_s16(int16x4_t a);
- effect : hold 4 individual 16 The bit value is extended to 4 individual 32 The number of bits , amount to :
int16_t a = 3; int32_t b = (int32_t)a;
- Wide to narrow instructions :int16x4_t vqmovn_s32(int32x4_t a);
- effect : From wide to narrow characters , Due to possible overflow , So we need to do saturation operation
Type conversion
- Instructions :float32x4_t vcvtq_f32_s32(int32x4_t a);
- effect : take 32 Bit integer conversion to 32 Bit floating point ,cvt yes convert Abbreviation
Type reinterpretation
- Instructions :int8x16_t vreinterpretq_s8_f32(float32x4_t a); // take float32x4 Of a Interpreted as int8x16_t type
- effect : Vector reinterprets type conversion operations
- explain : Do not change the value itself , Decode the original binary values into different types
operation
Shift left and right
- Shift left command :uint32x4_t vshlq_n_u32(uint32x4_t a, __constrange(0,31) int b); // Move left ,b Range :[0, 31]
- Shift right command :uint32x4_t vshrq_n_u32(uint32x4_t a, __constrange(1,32) int b); // Move right ,b Range :[1, 32]
- effect : A vector is shifted left and right by a constant scalar
Absolute value of the difference
- Instructions :float32x4_t vabdq_f32(float32x4_t a, float32x4_t b);
- effect :vr = |a - b|
- explain : Can directly operate with scalar , Only multiplication ; Others such as maximum value , No addition or subtraction
Maximum
- Instructions :float32x4_t vmaxq_f32(float32x4_t a, float32x4_t b); // a1, a2, a3, a4; b1, b2, b3, b4;
- effect : Take the maximum value in pairs , The output value is :
[max(a1, b1), max(a2, b2), ..., max(a4, b4)]
Collapse max
- Instructions :float32x2_t vpmax_f32(float32x2_t a, float32x2_t b); // a1, a2; b1, b2;
- effect : Take the maximum value of phase zero pair , The output value is :
[max(a1, a2), max(b1, b2)]
Compare the size
Less than comparison :uint32x4_t vcltq_f32(float32x4_t a, float32x4_t b); // Judge a<b
<= Compare
:uint32x4_t vcleq_s32(int32x4_t a, int32x4_t b); // Judge a<=b>= Compare
:uint32x4_t vcgeq_s32(int32x4_t a, int32x4_t b); // Judge a>=bAbbreviated mnemonic :
clt(compare less than),cgt(compare grete than),ceq(comprae equal), ge(>=), le(<=)
Return type : An unsigned number , The bit width is the same as the input parameter
Select by bit
- Instructions :int32x4_t vbslq_s32(uint32x4_t a, int32x4_t b, int32x4_t c);
- The usage function : take a Every one of the judges , if 1, The output b The middle corresponding bit ; Otherwise output c The middle corresponding bit
- matters needing attention : Usually compare the output of a function with a Use a combination of ,a Compare results for unsigned , Return value type and input value b/c The same type
Element inversion in a vector
Instructions :uint8x8_t vrev16_u8(uint8x8_t vec);
Mnemonic symbol :vrev(bit)_(type)
effect : The specified number of bits in the vector bit, Exchange in pairs
give an example :
uint8x8_t src = { 1,2,3,4,5,6,7,8}; dst = vrev16_u8(src) --> dst = { 2,1,4,3,6,5,8,7} // Press 16 Position as 1 team , Inside with 8 Bits are elements in reverse order dst = vrev64_u8(src) --> dst = { 8,7,6,5,4,3,2,1} // Press 64 Position as 1 team , Inside with 8 Bits are elements in reverse order
边栏推荐
- 「杂谈」如何改善数据分析工作中的三大被动局面
- Memory Limit Exceeded
- Geoffrey Hinton: my 50 years of in-depth study and Research on mental skills
- 挑选智能音箱时,首选“智能”还是“音质”?这篇文章给你答案
- TCP packet sticking problem
- PyTorch学习(三)
- NEON优化2:ARM优化高频指令总结
- 【TiDB】TiCDC canal_ Practical application of JSON
- 小程序容器技术,促进园区运营效率提升
- Sword finger offer 16 Integer power of numeric value
猜你喜欢
php利用队列解决迷宫问题
System integration project management engineer certification high frequency examination site: prepare project scope management plan
【TiDB】TiCDC canal_json的实际应用
传统微服务框架如何无缝过渡到服务网格 ASM
充值满赠,IM+RTC+X 全通信服务「回馈季」开启
Personally test the size of flutter after packaging APK, quite satisfied
Adhering to the concept of 'home in China', 2022 BMW children's traffic safety training camp was launched
The cloud native landing practice of using rainbow for Tuowei information
mysql for update 死锁问题排查
3.10 haas506 2.0开发教程-example-TFT
随机推荐
Go Redis连接池
Compilation problems and solutions of teamtalk winclient
新版EasyGBS如何配置WebRTC视频流格式播放?
C WinForm program interface optimization example
MRO industrial products procurement management system: enable MRO enterprise procurement nodes to build a new digital procurement system
How to seamlessly transition from traditional microservice framework to service grid ASM
openGauss数据库源码解析系列文章—— 密态等值查询技术详解(上)
Solution of enterprise supply chain system in medical industry: realize collaborative visualization of medical digital intelligent supply chain
「杂谈」如何改善数据分析工作中的三大被动局面
opencv数据类型代码表 dtype
Personally test the size of flutter after packaging APK, quite satisfied
Multipass Chinese document - setting graphical interface
EasyNVR平台设备通道均在线,操作出现“网络请求失败”是什么原因?
【合集- 行业解决方案】如何搭建高性能的数据加速与数据编排平台
电子元器件招标采购商城:优化传统采购业务,提速企业数字化升级
服务器之间传文件夹,文件夹内容为空
MRO工业品采购管理系统:赋能MRO企业采购各节点,构建数字化采购新体系
基于UDP协议设计的大文件传输软件
Memory Limit Exceeded
一文详解|Go 分布式链路追踪实现原理