当前位置:网站首页>Performance optimization topics

Performance optimization topics

2022-06-22 09:10:00 zuhan_ twenty million two hundred and ten thousand three hundre

Recently, the project team is putting into operation the bottleneck problem of performance optimization , The optimization tools and optimization points are summarized as follows :

System level performance optimization usually consists of two phases : Performance analysis (performance profiling) And code optimization .
The goal of performance profiling is to find performance bottlenecks , Find out the causes of performance problems and hot codes .
The goal of code optimization is to optimize code or compilation options for specific performance problems , To improve software performance .

Improve instruction utilization , Need to improve ​ Instruction level parallelism (Instruction Level Parallelism, ILP), Reduce data dependency and memory access dependency .
Function jump , Memory barrier , Branch prediction failure will lead to pipeline ​ jam , Affect instruction execution efficiency . For example, the reasonable use of inline functions can sacrifice storage space , Reduce jump instructions and stack pressing and stack pushing operations , thus ​ Improve execution efficiency .

One 、 Performance profiling tools :
1、 perf Tools
effect : View function occupancy CPU The proportion 、cache Hit rate, etc
Usage mode : take perf Tool put /sbin/ Package and execute under the directory
Common commands :
perf top Display the occupation of each function cpu The proportion
perf stat Used to run instructions , And analyze the statistical results
perf stat -r 5 -e cache-misses,cache-references see cache-miss rate

https://wudaijun.com/2019/04/linux-perf/
https://www.cnblogs.com/arnoldlu/p/6241297.html
https://www.cnblogs.com/sunsky303/p/8328836.html
Principle introduction :https://blog.csdn.net/chichi123137/article/details/80139237

2、 profile Tools
effect : Check the function time consumption
Clear command :
readprofile -m /proc/system.ko –p /proc/profile -r
Commands for viewing :
readprofile -m /proc/system.ko –p /proc/profile -v

https://www.cnblogs.com/no7dw/archive/2012/10/17/2727692.html

3、 linux The system has its own commands
(1)top see cpu Usage rate
(2)vmstat 1 see [ Number of context switches ]
Among them cs(context switch) This is the number of context switches per second , According to different scenes ,CPU Context switching can also be divided into interrupt context switching 、 Thread context switch and process context switch , But whatever it is , Too many context switches , Will put CPU Time consumed in registers 、 Kernel stack and virtual memory data storage and recovery , So as to shorten the actual running time of the process , The overall performance of the system is greatly reduced .vmstat In the output of us、sy User mode and kernel mode respectively CPU utilization , These two values are also very referential .
(3)ps View some process status and usage

4、Lmbench
It is a simple and portable , accord with ANSI/C The standard for the UNIX/POSIX And the development of micro evaluation tools . Generally speaking , It measures two key characteristics : Reaction time and bandwidth .
< http://lmbench.sourceforge.net/>
http://www.bitmover.com/lmbench/
https://winddoing.github.io/post/54953.html
Two 、 Code optimization :

  1. cacheline alignment
    Avoid reading data across 2 individual cacheline, The structure can cacheline alignment , Successive arrays can try the first address cacheline alignment , But it may cause waste .
    Statistical variables cacheline alignment
    for example :
    extern ALIGNED_STAT_U64_S g_ulGetCyclesSessionAdd[48];
    typedef struct tagAlignedu64Stat
    {
    ULONG ulNum;
    }attribute((aligned(64))) ALIGNED_STAT_U64_S;

cache The understanding of the :
http://blog.chinaunix.net/uid-28541347-id-5830692.html

arm cache Study :
https://www.cnblogs.com/fanru5161/p/10640108.html

cache Related details :
https://blog.csdn.net/yhb1047818384/article/details/79604976?spm=1001.2014.3001.5502
 Insert picture description here
 Insert picture description here
 Insert picture description here
L1 and L2 Cache It's all about CPU core Have one... On your own , and L3 Cache It's a few Cores Shared , Think of it as a smaller but faster memory .

  1. Branch prediction
    have access to likely/unlikely Such a macro , Improve cacheline The probability of hit .
    Reference link :https://www.cnblogs.com/LubinLew/p/GCC-__builtin_expect.html
    When there are multiple conditional judgments , Adjust the order of each branch according to the probability .
    When the processor is able to cache The desired data is found in the memory , It is called a cache hit (cache hit). conversely , If CPU stay cache No data found in , It is called a cache Not hit (cache miss).
  2. Delay calculation
    Do not initialize recently unused variables , May be contrary to the programming specification .
  3. Register parameters
    Try to use registers as function parameters , That is, function parameters should be as few as possible .
  4. Related codes are adjacent
    Related codes or files should be adjacent as much as possible , The relevant code is compiled together , Improve cache shooting .
  5. Code redundancy
    Reduce redundant code and dead code .
  6. Read / write separation
    Two unrelated variables , A read , One writes , And this Two variables in one cache line Inside . Then writing will lead to cache line invalid .
  7. Data prefetching prefetch()
    The basis of data prefetching is that the prefetched data will be used immediately , This should be consistent with spatial locality (spatial locality), But how to know that the prefetched data will be used , This depends on the context . Generally speaking , Data prefetching is often used in loops , Because loops are the most spatially localized code .
    Data prefetching __builtin_prefetch()
void __builtin_prefetch (const void *addr, ...)
 for example :
__builtin_prefetch(addr,0,3);
__builtin_prefetch(addr,1,3);

The function also has two optional arguments ,rw and locality .

rw Is a compile time constant , or 1 or 0 .1 It means to write (w),0 It means to read .

locality Must be a compile time constant , Also known as “ Temporal locality ”(temporal locality) . Temporal locality refers to , If an instruction in a program is executed , The instruction may be executed again soon ; If some data is accessed , The data will be accessed again soon . The range of this value is 0 - 3 Between . by 0 It means , It has no temporal locality , in other words , The data or address to be accessed will not be accessed for a long time after being accessed ; by 3 It means , The accessed data or address has a high Temporal locality , in other words , It is very likely to visit again soon after being visited ; For value 1 and 2, Then, respectively, it means having low Temporal locality And medium Temporal locality . This value defaults to 3 .
come from https://www.cnblogs.com/dongzhiquan/p/3694858.html

come from https://www.cnblogs.com/HadesBlog/p/13741551.html

https://blog.csdn.net/kongfuxionghao/article/details/47028919

https://www.cnblogs.com/pengdonglin137/p/3716889.html

9、 Memory coloring
The data in the memory is based on the data in the memory cache line Indexes [getCacheLineIndex(addr)] Can only be put into one cache way Corresponding to cache line Inside . Suppose you have extracted... From the address cache line The index of i, Then the hardware will access all at the same time cache way Of the i block cache line, Find a free line i Of cache way, Then the data can be put into the free row . If m individual cache way No free land was found in the i That's ok , Start the elimination strategy , Find a blank line ~.

For example, a cache Yes 4 individual cache way(4 road cache), Every cache way Yes 16 individual cache line. Memory address of a data structure cache line The index for 2, Then it can only be put into one of cache way ( one of the 4 cache ways ) Of the 2 individual cache line in . If all cache way The second cache line Are used , Then you must change out of one . So if there are multiple data addresses cache line identical , Even if cache There is still a lot of room in the , The competition is still fierce . for fear of Cache Replace , The addresses of different data structures correspond to cache line The index should not be the same , Otherwise, the probability of conflict increases .
 Insert picture description here

color Will be different slab The address of the same data structure in , So these data structures are cache line The indexes are staggered . So as to make better use of cache .
in application ,x86 buffer Application example
/4 individual buff A cycle , The first 1 individual buff And the 0 individual buff There's a gap between them 16K+64 byte , The first 2 individual buff And the 1 individual buff There's a gap between them 16K+642 byte , The first 3 individual buff And the 2 individual buff There's a gap between them 16K+643 byte ,/
https://blog.csdn.net/midion9/article/details/50910543

10、 Memory interleaving
from DDR For the memory access feature of , For the same piece DDR, Some time interval is required between two memory accesses , This includes CL (CAS Time delay ), tRCD(RAS To CAS Time delay ),tRP( Pre charge effective period ) etc. .

In order to improve the DDR Memory access speed , Multiple channels can be used (channel) technology . Typical desktops and notebooks CPU Dual channel has been supported for a long time , Now three channels have been added . If the data is distributed on memory modules inserted on different channels , The memory controller can ignore the above delays and timing , Read them at the same time , The speed can be doubled or even tripled ( If more channels are supported , Then the speed is increased more ). Qualcomm's first generation ARM The server SoC The chip uses 4 individual DDR controller , It supports four channels .

But due to the limitations of the procedure , A program doesn't put data everywhere , And fall into another DIMM in , Often the program and data are in the same DIMM in , add CPU Of Cache It will help you prefetch the data , The speed increase of this multi-channel is not so obvious .

At this time, we need to use another method to improve the speed , It is to distribute the same block of memory to different channels , This technique is called interweaving ,Interleaving. It doesn't matter Cache Whether it is hit or not can be accessed at the same time , Multi channel technology can be more useful .

link :https://www.jianshu.com/p/6f8ffc43a561

11、inline function
Inline or not inline, This is a problem .Inline It can reduce the cost of function calls ( Push , Stack out operation ), however inline also It may cause a lot of duplicate code , Make the code larger .Inline Yes debug There are also disadvantages ( Assembly and language do not match ). therefore Be careful when using this . Small function ( Less than 10 That's ok ), You can try inline; A function that is called many times or for a long time , Try not to Use inline.
11、 Of course, there are some variables based on the actual business code 、 Parameter debugging , Specific analysis of specific problems .

Reference link :
Performance optimization summary :
https://blog.csdn.net/armlinuxww/article/details/89709660

https://blog.csdn.net/weixin_39860915/article/details/103519157?spm=1001.2101.3001.6650.7&utm_medium=distribute.pc_relevant.none-task-blog-2~default~BlogCommendFromBaidu~default-7.no_search_link&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2~default~BlogCommendFromBaidu~default-7.no_search_link

原网站

版权声明
本文为[zuhan_ twenty million two hundred and ten thousand three hundre]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220904438700.html