当前位置:网站首页>Neon Optimization: summary of performance optimization experience
Neon Optimization: summary of performance optimization experience
2022-07-07 01:14:00 【To know】
NEON Optimize : Performance optimization experience summary
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
This article summarizes the commonly used NEON Optimization techniques and important references .
photo from 《Practical approach to Arm Neon Optimization》, It shows various fields such as NEON Optimized excellent example code base ,Neon-enabled libraries.
Optimization techniques
High frequency common
Hotspot functions involve a large number of IO When reading and writing , The memory address of the data should be the same as NEON Array or system bit alignment , Such as 32 Bit alignment , It can reduce the access cost
run demo obtain profile, Look at the proportion of expenses Top5, It must be optimized for a version
Priority should be given to NEON Instruction Parallel Computing , It can greatly reduce the cost
for loop
- Circular numbers are not preferred 4 The row expands ; For large circulation 8 Parallel Computing
- Loop variable i, Try to set it to basic data type , Such as 32 position
- Such as for In circulation if Optimize ,if Chinese vs cnt Count , use 4 Registers are stored , At the end 4 Add register values
- When using the subtraction counter in the loop, the overhead can be reduced
- Such as : for (b = 0; b < num; b++)
- Change it to :for (b = 0; b < num - 3; b += 4)
- It can be changed :for (b = num - 1; b - 3 >= 0; b -= 4)
- It can be changed :for (b = num - 1; b >= 3; b -= 4)
- Be sure to remember when reading and writing data , Move the coordinates forward by the corresponding position , Original read :
cf[b]
, Today read :cf[b - 3]
Array index value
- Array index and the operation involved in the index , Try to replace it with pointer offset plus or minus
- Avoid large index jumps , Reduce cache miss, See... For specific examples : Matrix multiplication optimization , Divided into 4*4 Handle
Memory usage
- Local variables are preferred , Instead of malloc Heap memory , Reduce cache miss
- For specific variable types , Manual for Loop parallel copy value , Maybe it's better than memcpy() Functions are more efficient , because memcpy There is also a lot of judgment involved inside , To ensure platform compatibility
- involve IO When reading and writing , Soft imitation data may not accurately reflect hard imitation results , Hard copy shall prevail , Soft copy cannot simulate the latest chip level optimization
Instruction operation
- Matrix multiplication scene , Without significantly increasing register variables , External A It's also best to read more data in parallel , Follow B Column operations of , Reduce B Number of reads per column
- Multiply and add instructions ,add and mul Can be combined into mla, One instruction completes the multiply add operation
Algorithmic perspective
- Observe the algorithm of hot spot function , Optimize the time complexity from the algorithm level , Carry out equivalent implementation
- Whether the data is orderly ,for Whether the cycle can become dichotomy
- Whether bubble sorting can become merging
- Whether there is redundant logic , Redundant variable calculation
- Observe the algorithm of hot spot function , Optimize the time complexity from the algorithm level , Carry out equivalent implementation
details
- Bad 1 problem : Pay attention to... Again !
- It is found that parallel computing often occurs in loops :
-4, +=4
Combination question , Even if there is 4 Values can only be calculated separately . - Should be changed to :
-3,+=4
; Other similar :-7, +=8
; Whether it's <=、< or >= scene . - give an example :
- original :
for (i = 0; i < size; i++)
- parallel :
for (i = 0; i < size - 3; i += 4)
- Finish up :
for (; i < size; i ++)
- original :
- principle : With 4 Take Lu as an example
- Parallel writing ,-3 The purpose of is to ensure from i From the beginning, there will be 4 Two available values ,+4 The goal is to i To iterate , Every time 4 Go forward once
- The last word , The purpose of adding finishing is to deal with size Can not be 4 The scene of division , Not enough at the end 4 The remaining values of are processed separately
- It is found that parallel computing often occurs in loops :
- NEON Optimize macro switches
- Commonly used
#ifndef XX_NO_NEON
Macro switch to control NEON switch , Easy to fall back NEON Optimize the version debug
- Commonly used
- Bad 1 problem : Pay attention to... Again !
Compilation options
Initial compilation options :O0,Omemory( Optimize memory ), Open inline
Optimize compilation options :O2,Otime( Optimize time ), Do not open inline ; Generally open O2
When baseline optimization is recommended , Try to choose
Do not open inlining to improve the view
, In order to observe the actual overhead in the hotspot function- Opened inline , It's easy to see , Some big overhead functions I Inline to X Function , But just look X The function can't be optimized for a long time
- Such as guanneilian , Then the overhead function I Is far more than X function , Become Top Hotspot overhead function , Convenient for optimization analysis
Try to use O3 Optimization is used O3, Some codes that cannot be used alone O2 Optimize compilation , Then link the module , Together O3 compile
Don't be easy restore default Compile settings , Otherwise, all configurations will fail , Such as compilation platform 、 Compilation options
matters needing attention
Cost calculation
- When calculating the cost of each frame , Make sure that the duration of each frame is 10ms、20ms, Modify the corresponding frame length
- RVDS Calculated in MCPS Essential for MIPS, Pay attention to distinguish it from the result of hard imitation
Other questions
use #ifndef XX_NO_NEON
To define about NEON The macro , What's the intention , Why don't #ifdef XX_NEON
?
#ifndef ALG_NO_NEON
// opt_neon_code
...
#else
// origin code
...
#endif
- ifndef The meaning of is , Because the baseline output view belongs to low-frequency operation , And check every time NEON The progress of optimization is high frequency operation , So by default NEON Optimize , When viewing the baseline, you only need to add the corresponding macro .
- Of course , Such as habit
#ifdef XX_NEON
You can't help it , It doesn't affect the effect .
Related resources
Attach good resources used in the learning process , The last reference link is colored eggs , involve :Optimizing Software in C++ - Agner Fog - PDF,C++ Software performance optimization .
This book is all C++ A book that programmers should read , It does everything from the language level 、 Compiler level 、 Memory access level 、 Multithreading level 、CPU Level describes how to tune software performance , It is a classic e-book , Welcome to advanced reading !
Data link summary :
- google keywords: neon optimization, A large number of relevant resources can be obtained
- NEON Optimize ARM The official introduction ,link
- NEON Optimize ARM Official introduction article translation ,link
- ARM Official programmers Neon Programming Guide ,link
- Optimizing C Code with Neon Intrinsics, link
- ARM Neon Intrinsics Learning points north : From the beginning 、 Advanced to learn a thorough ,link
- How to improve software performance with NEON, link
- ittiam: Professional optimization materials of Indian ether company ,link1,link2
- AI neural network CNN in im2col Calculation optimization of matrix convolution kernel ,link
- Google development cmath.h Of NEON Optimize , Library links :link, Library description :link
- blockbuster ! performance optimization PDF:Optimizing software in C++,An optimization guide for Windows, Linux, and Mac platforms,link
边栏推荐
- Part VI, STM32 pulse width modulation (PWM) programming
- 迈动互联中标北京人寿保险,助推客户提升品牌价值
- JTAG debugging experience of arm bare board debugging
- The printf function is realized through the serial port, and the serial port data reception is realized by interrupt
- Provincial and urban level three coordinate boundary data CSV to JSON
- windows安装mysql8(5分钟)
- How to evaluate load balancing performance parameters?
- golang中的atomic,以及CAS操作
- ZABBIX 5.0: automatically monitor Alibaba cloud RDS through LLD
- 第七篇,STM32串口通信编程
猜你喜欢
Data type of pytorch tensor
让我们,从头到尾,通透网络I/O模型
深入探索编译插桩技术(四、ASM 探秘)
身体质量指数程序,入门写死的小程序项目
做微服务研发工程师的一年来的总结
【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析
Part IV: STM32 interrupt control programming
批量获取中国所有行政区域经边界纬度坐标(到县区级别)
Windows installation mysql8 (5 minutes)
Telerik UI 2022 R2 SP1 Retail-Not Crack
随机推荐
taro3.*中使用 dva 入门级别的哦
The printf function is realized through the serial port, and the serial port data reception is realized by interrupt
golang中的atomic,以及CAS操作
Pytorch中torch和torchvision的安装
A brief history of deep learning (I)
深入探索编译插桩技术(四、ASM 探秘)
golang中的WaitGroup实现原理
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
Taro中添加小程序 “lazyCodeLoading“: “requiredComponents“,
pyflink的安装和测试
"Exquisite store manager" youth entrepreneurship incubation camp - the first phase of Shunde market has been successfully completed!
接收用户输入,身高BMI体重指数检测小业务入门案例
【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析
Come on, don't spread it out. Fashion cloud secretly takes you to collect "cloud" wool, and then secretly builds a personal website to be the king of scrolls, hehe
ARM裸板调试之JTAG原理
【js】获取当前时间的前后n天或前后n个月(时分秒年月日都可)
Part IV: STM32 interrupt control programming
力扣1037. 有效的回旋镖
Activereportsjs 3.1 Chinese version | | | activereportsjs 3.1 English version
详解OpenCV的矩阵规范化函数normalize()【范围化矩阵的范数或值范围(归一化处理)】,并附NORM_MINMAX情况下的示例代码