当前位置:网站首页>Performance optimization topics
Performance optimization topics
2022-06-22 09:10:00 【zuhan_ twenty million two hundred and ten thousand three hundre】
Recently, the project team is putting into operation the bottleneck problem of performance optimization , The optimization tools and optimization points are summarized as follows :
System level performance optimization usually consists of two phases : Performance analysis (performance profiling) And code optimization .
The goal of performance profiling is to find performance bottlenecks , Find out the causes of performance problems and hot codes .
The goal of code optimization is to optimize code or compilation options for specific performance problems , To improve software performance .
Improve instruction utilization , Need to improve Instruction level parallelism (Instruction Level Parallelism, ILP), Reduce data dependency and memory access dependency .
Function jump , Memory barrier , Branch prediction failure will lead to pipeline jam , Affect instruction execution efficiency . For example, the reasonable use of inline functions can sacrifice storage space , Reduce jump instructions and stack pressing and stack pushing operations , thus Improve execution efficiency .
One 、 Performance profiling tools :
1、 perf Tools
effect : View function occupancy CPU The proportion 、cache Hit rate, etc
Usage mode : take perf Tool put /sbin/ Package and execute under the directory
Common commands :
perf top Display the occupation of each function cpu The proportion
perf stat Used to run instructions , And analyze the statistical results
perf stat -r 5 -e cache-misses,cache-references see cache-miss rate
https://wudaijun.com/2019/04/linux-perf/
https://www.cnblogs.com/arnoldlu/p/6241297.html
https://www.cnblogs.com/sunsky303/p/8328836.html
Principle introduction :https://blog.csdn.net/chichi123137/article/details/80139237
2、 profile Tools
effect : Check the function time consumption
Clear command :
readprofile -m /proc/system.ko –p /proc/profile -r
Commands for viewing :
readprofile -m /proc/system.ko –p /proc/profile -v
https://www.cnblogs.com/no7dw/archive/2012/10/17/2727692.html
3、 linux The system has its own commands
(1)top see cpu Usage rate
(2)vmstat 1 see [ Number of context switches ]
Among them cs(context switch) This is the number of context switches per second , According to different scenes ,CPU Context switching can also be divided into interrupt context switching 、 Thread context switch and process context switch , But whatever it is , Too many context switches , Will put CPU Time consumed in registers 、 Kernel stack and virtual memory data storage and recovery , So as to shorten the actual running time of the process , The overall performance of the system is greatly reduced .vmstat In the output of us、sy User mode and kernel mode respectively CPU utilization , These two values are also very referential .
(3)ps View some process status and usage
4、Lmbench
It is a simple and portable , accord with ANSI/C The standard for the UNIX/POSIX And the development of micro evaluation tools . Generally speaking , It measures two key characteristics : Reaction time and bandwidth .
< http://lmbench.sourceforge.net/>
http://www.bitmover.com/lmbench/
https://winddoing.github.io/post/54953.html
Two 、 Code optimization :
- cacheline alignment
Avoid reading data across 2 individual cacheline, The structure can cacheline alignment , Successive arrays can try the first address cacheline alignment , But it may cause waste .
Statistical variables cacheline alignment
for example :
extern ALIGNED_STAT_U64_S g_ulGetCyclesSessionAdd[48];
typedef struct tagAlignedu64Stat
{
ULONG ulNum;
}attribute((aligned(64))) ALIGNED_STAT_U64_S;
cache The understanding of the :
http://blog.chinaunix.net/uid-28541347-id-5830692.html
arm cache Study :
https://www.cnblogs.com/fanru5161/p/10640108.html
cache Related details :
https://blog.csdn.net/yhb1047818384/article/details/79604976?spm=1001.2014.3001.5502


L1 and L2 Cache It's all about CPU core Have one... On your own , and L3 Cache It's a few Cores Shared , Think of it as a smaller but faster memory .
- Branch prediction
have access to likely/unlikely Such a macro , Improve cacheline The probability of hit .
Reference link :https://www.cnblogs.com/LubinLew/p/GCC-__builtin_expect.html
When there are multiple conditional judgments , Adjust the order of each branch according to the probability .
When the processor is able to cache The desired data is found in the memory , It is called a cache hit (cache hit). conversely , If CPU stay cache No data found in , It is called a cache Not hit (cache miss). - Delay calculation
Do not initialize recently unused variables , May be contrary to the programming specification . - Register parameters
Try to use registers as function parameters , That is, function parameters should be as few as possible . - Related codes are adjacent
Related codes or files should be adjacent as much as possible , The relevant code is compiled together , Improve cache shooting . - Code redundancy
Reduce redundant code and dead code . - Read / write separation
Two unrelated variables , A read , One writes , And this Two variables in one cache line Inside . Then writing will lead to cache line invalid . - Data prefetching prefetch()
The basis of data prefetching is that the prefetched data will be used immediately , This should be consistent with spatial locality (spatial locality), But how to know that the prefetched data will be used , This depends on the context . Generally speaking , Data prefetching is often used in loops , Because loops are the most spatially localized code .
Data prefetching __builtin_prefetch()
void __builtin_prefetch (const void *addr, ...)
for example :
__builtin_prefetch(addr,0,3);
__builtin_prefetch(addr,1,3);
The function also has two optional arguments ,rw and locality .
rw Is a compile time constant , or 1 or 0 .1 It means to write (w),0 It means to read .
locality Must be a compile time constant , Also known as “ Temporal locality ”(temporal locality) . Temporal locality refers to , If an instruction in a program is executed , The instruction may be executed again soon ; If some data is accessed , The data will be accessed again soon . The range of this value is 0 - 3 Between . by 0 It means , It has no temporal locality , in other words , The data or address to be accessed will not be accessed for a long time after being accessed ; by 3 It means , The accessed data or address has a high Temporal locality , in other words , It is very likely to visit again soon after being visited ; For value 1 and 2, Then, respectively, it means having low Temporal locality And medium Temporal locality . This value defaults to 3 .
come from https://www.cnblogs.com/dongzhiquan/p/3694858.html
come from https://www.cnblogs.com/HadesBlog/p/13741551.html
https://blog.csdn.net/kongfuxionghao/article/details/47028919
https://www.cnblogs.com/pengdonglin137/p/3716889.html
9、 Memory coloring
The data in the memory is based on the data in the memory cache line Indexes [getCacheLineIndex(addr)] Can only be put into one cache way Corresponding to cache line Inside . Suppose you have extracted... From the address cache line The index of i, Then the hardware will access all at the same time cache way Of the i block cache line, Find a free line i Of cache way, Then the data can be put into the free row . If m individual cache way No free land was found in the i That's ok , Start the elimination strategy , Find a blank line ~.
For example, a cache Yes 4 individual cache way(4 road cache), Every cache way Yes 16 individual cache line. Memory address of a data structure cache line The index for 2, Then it can only be put into one of cache way ( one of the 4 cache ways ) Of the 2 individual cache line in . If all cache way The second cache line Are used , Then you must change out of one . So if there are multiple data addresses cache line identical , Even if cache There is still a lot of room in the , The competition is still fierce . for fear of Cache Replace , The addresses of different data structures correspond to cache line The index should not be the same , Otherwise, the probability of conflict increases .
color Will be different slab The address of the same data structure in , So these data structures are cache line The indexes are staggered . So as to make better use of cache .
in application ,x86 buffer Application example
/4 individual buff A cycle , The first 1 individual buff And the 0 individual buff There's a gap between them 16K+64 byte , The first 2 individual buff And the 1 individual buff There's a gap between them 16K+642 byte , The first 3 individual buff And the 2 individual buff There's a gap between them 16K+643 byte ,/
https://blog.csdn.net/midion9/article/details/50910543
10、 Memory interleaving
from DDR For the memory access feature of , For the same piece DDR, Some time interval is required between two memory accesses , This includes CL (CAS Time delay ), tRCD(RAS To CAS Time delay ),tRP( Pre charge effective period ) etc. .
In order to improve the DDR Memory access speed , Multiple channels can be used (channel) technology . Typical desktops and notebooks CPU Dual channel has been supported for a long time , Now three channels have been added . If the data is distributed on memory modules inserted on different channels , The memory controller can ignore the above delays and timing , Read them at the same time , The speed can be doubled or even tripled ( If more channels are supported , Then the speed is increased more ). Qualcomm's first generation ARM The server SoC The chip uses 4 individual DDR controller , It supports four channels .
But due to the limitations of the procedure , A program doesn't put data everywhere , And fall into another DIMM in , Often the program and data are in the same DIMM in , add CPU Of Cache It will help you prefetch the data , The speed increase of this multi-channel is not so obvious .
At this time, we need to use another method to improve the speed , It is to distribute the same block of memory to different channels , This technique is called interweaving ,Interleaving. It doesn't matter Cache Whether it is hit or not can be accessed at the same time , Multi channel technology can be more useful .
link :https://www.jianshu.com/p/6f8ffc43a561
11、inline function
Inline or not inline, This is a problem .Inline It can reduce the cost of function calls ( Push , Stack out operation ), however inline also It may cause a lot of duplicate code , Make the code larger .Inline Yes debug There are also disadvantages ( Assembly and language do not match ). therefore Be careful when using this . Small function ( Less than 10 That's ok ), You can try inline; A function that is called many times or for a long time , Try not to Use inline.
11、 Of course, there are some variables based on the actual business code 、 Parameter debugging , Specific analysis of specific problems .
Reference link :
Performance optimization summary :
https://blog.csdn.net/armlinuxww/article/details/89709660
边栏推荐
- 断言assert()
- My first go program
- Servlet的生命周期
- OpenCV每日函数 直方图相关(3)
- The version problem caused "unable to locate the program input point openssl\u sk\new\u reserve in the dynamic link library c:\users... \libssl-1\u 1-x64.dll"
- copy_from_user和copy_to_user
- C语言刷题 | 三目运算实现判断大写(16)
- [network security officer] an attack technology that needs to be understood - high hidden and high persistent threats
- MySQL field attribute list sends a document for future reference
- PHP seven methods to obtain complete instances of file name suffixes [collect]
猜你喜欢
![[target detection] | detection error mechanism why object detectors fail: investigating the influence of the dataset](/img/d2/101c8ef5dac517718bbe44ee4fd607.png)
[target detection] | detection error mechanism why object detectors fail: investigating the influence of the dataset

【目标检测】|检测错误机制 Why Object Detectors Fail: Investigating the Influence of the Dataset

Xshell远程服务器tensorboard/visdom的本地可视化方法【亲测一步有效】

Fanatical NFT, foam or tuyere?

深入解析final关键字的用法

【node】快收下爬虫,我们不再为数据发愁

OpenCV每日函数 直方图相关(3)

DOM编程
![[tensorboard] step on all minefields and solve all your problems](/img/35/fc0f7ed311bf7c0321e1257ff6a1a6.png)
[tensorboard] step on all minefields and solve all your problems

Hashtable source code analysis, collections Synchronizedmap parsing
随机推荐
traefik ingress实践
Hashtable source code analysis, collections Synchronizedmap parsing
Data and data type conversion in MATLAB
为啥要使用梯度下降法
DOM编程
Express bird of Express query demonstration code (php+curl)
让你轻松上手【uni-app】
值(址)传递,看清名字,别掉沟里
My first go program
C语言刷题 | 输入一个数输出对应的值(13)
Matrix decomposition
How much do you know about the required encryption industry terms in 2022?
Interview shock 59: can there be multiple auto increment columns in a table?
Sound and shadow 2022 heavy release! Detailed explanation of the new functions of Huisheng Huiying 2022
Why can MySQL indexes improve query efficiency so much?
Read all files under the folder in the jar package
It is hoped that more and more women will engage in scientific and technological work
Pytorch oserror: DLL load failed: problem solving
threejs实现简单全景看房demo
MSSQL injection of SQL injection