当前位置:网站首页>Neon Optimization: summary of performance optimization experience

Neon Optimization: summary of performance optimization experience

2022-07-07 01:14:00 To know

NEON Optimization series :

  1. NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
  2. NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
  3. NEON Optimize 3: Matrix transpose instruction optimization case ,link
  4. NEON Optimize 4:floor/ceil Optimization case of function ,link
  5. NEON Optimize 5:log10 Optimization case of function ,link
  6. NEON Optimize 6: About cross access and reverse cross access ,link
  7. NEON Optimize 7: Performance optimization experience summary ,link
  8. NEON Optimize 8: Performance optimization FAQs QA,link

This article summarizes the commonly used NEON Optimization techniques and important references .

ll

photo from 《Practical approach to Arm Neon Optimization》, It shows various fields such as NEON Optimized excellent example code base ,Neon-enabled libraries.

Optimization techniques


High frequency common

  • Hotspot functions involve a large number of IO When reading and writing , The memory address of the data should be the same as NEON Array or system bit alignment , Such as 32 Bit alignment , It can reduce the access cost

  • run demo obtain profile, Look at the proportion of expenses Top5, It must be optimized for a version

  • Priority should be given to NEON Instruction Parallel Computing , It can greatly reduce the cost

  • for loop

    • Circular numbers are not preferred 4 The row expands ; For large circulation 8 Parallel Computing
    • Loop variable i, Try to set it to basic data type , Such as 32 position
    • Such as for In circulation if Optimize ,if Chinese vs cnt Count , use 4 Registers are stored , At the end 4 Add register values
    • When using the subtraction counter in the loop, the overhead can be reduced
      • Such as : for (b = 0; b < num; b++)
      • Change it to :for (b = 0; b < num - 3; b += 4)
      • It can be changed :for (b = num - 1; b - 3 >= 0; b -= 4)
      • It can be changed :for (b = num - 1; b >= 3; b -= 4)
      • Be sure to remember when reading and writing data , Move the coordinates forward by the corresponding position , Original read :cf[b], Today read :cf[b - 3]
  • Array index value

    • Array index and the operation involved in the index , Try to replace it with pointer offset plus or minus
    • Avoid large index jumps , Reduce cache miss, See... For specific examples : Matrix multiplication optimization , Divided into 4*4 Handle
  • Memory usage

    • Local variables are preferred , Instead of malloc Heap memory , Reduce cache miss
    • For specific variable types , Manual for Loop parallel copy value , Maybe it's better than memcpy() Functions are more efficient , because memcpy There is also a lot of judgment involved inside , To ensure platform compatibility
    • involve IO When reading and writing , Soft imitation data may not accurately reflect hard imitation results , Hard copy shall prevail , Soft copy cannot simulate the latest chip level optimization
  • Instruction operation

    • Matrix multiplication scene , Without significantly increasing register variables , External A It's also best to read more data in parallel , Follow B Column operations of , Reduce B Number of reads per column
    • Multiply and add instructions ,add and mul Can be combined into mla, One instruction completes the multiply add operation
  • Algorithmic perspective

    • Observe the algorithm of hot spot function , Optimize the time complexity from the algorithm level , Carry out equivalent implementation
      • Whether the data is orderly ,for Whether the cycle can become dichotomy
      • Whether bubble sorting can become merging
      • Whether there is redundant logic , Redundant variable calculation
  • details

    • Bad 1 problem : Pay attention to... Again !
      • It is found that parallel computing often occurs in loops :-4, +=4 Combination question , Even if there is 4 Values can only be calculated separately .
      • Should be changed to :-3,+=4; Other similar :-7, +=8; Whether it's <=、< or >= scene .
      • give an example :
        • original :for (i = 0; i < size; i++)
        • parallel :for (i = 0; i < size - 3; i += 4)
        • Finish up :for (; i < size; i ++)
      • principle : With 4 Take Lu as an example
        • Parallel writing ,-3 The purpose of is to ensure from i From the beginning, there will be 4 Two available values ,+4 The goal is to i To iterate , Every time 4 Go forward once
        • The last word , The purpose of adding finishing is to deal with size Can not be 4 The scene of division , Not enough at the end 4 The remaining values of are processed separately
    • NEON Optimize macro switches
      • Commonly used #ifndef XX_NO_NEON Macro switch to control NEON switch , Easy to fall back NEON Optimize the version debug

Compilation options

  • Initial compilation options :O0,Omemory( Optimize memory ), Open inline

  • Optimize compilation options :O2,Otime( Optimize time ), Do not open inline ; Generally open O2

  • When baseline optimization is recommended , Try to choose Do not open inlining to improve the view , In order to observe the actual overhead in the hotspot function

    • Opened inline , It's easy to see , Some big overhead functions I Inline to X Function , But just look X The function can't be optimized for a long time
    • Such as guanneilian , Then the overhead function I Is far more than X function , Become Top Hotspot overhead function , Convenient for optimization analysis
  • Try to use O3 Optimization is used O3, Some codes that cannot be used alone O2 Optimize compilation , Then link the module , Together O3 compile

  • Don't be easy restore default Compile settings , Otherwise, all configurations will fail , Such as compilation platform 、 Compilation options

matters needing attention

  • Cost calculation

    • When calculating the cost of each frame , Make sure that the duration of each frame is 10ms、20ms, Modify the corresponding frame length
    • RVDS Calculated in MCPS Essential for MIPS, Pay attention to distinguish it from the result of hard imitation

Other questions


use #ifndef XX_NO_NEON To define about NEON The macro , What's the intention , Why don't #ifdef XX_NEON

#ifndef ALG_NO_NEON
// opt_neon_code
...
#else
// origin code
...
#endif
  • ifndef The meaning of is , Because the baseline output view belongs to low-frequency operation , And check every time NEON The progress of optimization is high frequency operation , So by default NEON Optimize , When viewing the baseline, you only need to add the corresponding macro .
  • Of course , Such as habit #ifdef XX_NEON You can't help it , It doesn't affect the effect .

Related resources


Attach good resources used in the learning process , The last reference link is colored eggs , involve :Optimizing Software in C++ - Agner Fog - PDF,C++ Software performance optimization .

This book is all C++ A book that programmers should read , It does everything from the language level 、 Compiler level 、 Memory access level 、 Multithreading level 、CPU Level describes how to tune software performance , It is a classic e-book , Welcome to advanced reading !

Data link summary :

  • google keywords: neon optimization, A large number of relevant resources can be obtained
  • NEON Optimize ARM The official introduction ,link
  • NEON Optimize ARM Official introduction article translation ,link
  • ARM Official programmers Neon Programming Guide ,link
  • Optimizing C Code with Neon Intrinsics, link
  • ARM Neon Intrinsics Learning points north : From the beginning 、 Advanced to learn a thorough ,link
  • How to improve software performance with NEON, link
  • ittiam: Professional optimization materials of Indian ether company ,link1,link2
  • AI neural network CNN in im2col Calculation optimization of matrix convolution kernel ,link
  • Google development cmath.h Of NEON Optimize , Library links :link, Library description :link
  • blockbuster ! performance optimization PDF:Optimizing software in C++,An optimization guide for Windows, Linux, and Mac platforms,link
原网站

版权声明
本文为[To know]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207061722311373.html