当前位置:网站首页>[Kevin's third play in a row] is rust really slower than C? Further analyze queen micro assessment
[Kevin's third play in a row] is rust really slower than C? Further analyze queen micro assessment
2022-06-27 07:19:00 【51CTO】
author :Kevin Wang
yesterday , In this article, I analyzed the unreliability of micro assessment , Left a little technical details that were not analyzed clearly . Comment area @ Wangmingzhe suggested that I could use VTune Tools , I'll continue to analyze it today .
Previous review
In yesterday's Micro assessment , I passed through queen.rs Add... At the beginning of the program NOP Instructions and queen.c Call the test function and perf analysis , It is proved that the difference is caused by the code tested in the compilation results layout( Address location of relevant code in memory ) Resulting noise .
Today, use the tools to further analyze layout What has been affected .
Or from Canada NOP Start
Write a script , Automatic separate test at queen.rs Add 1 individual 、2 individual ...N individual NOP, See if there is any regularity in the impact on the test results .
So I got the following results :

Draw it as a graph :
queen.rs Test time and added NOP The number of relations
Same method , We are queen.c Of main Add... At the beginning of the function NOP To get C Language version of the curve .
queen.c Test time and added NOP The number of relations
You can see Rust The average value of the edition 533 And variance 177 comparison C The version should be better !( Of course, this can not be used as a standard to judge language performance ,CPU The optimization of the microarchitecture layer is too complicated and subtle , Special cases cannot be used to illustrate the problem )
C and Rust The execution time of both versions is about inserting NOP Count 16 It is a half cycle and changes periodically , So this cycle 16 What the hell is it ? Looking at the assembly code, I found that the compiler will automatically align the loop body to 16byte:
Add 9 individual nop A compilation of ,Block 2 Is the beginning of the first loop body
Add 10 individual NOP A compilation of ,Block 2 Is the alignment inserted by the compiler NOP, The original Block 2 Moved back 16byte become Block 3
Get a clue : In this case , Align the beginning of the first loop body to an odd number 16byte You will get better performance .
use VTune To run the
Yesterday we analyzed and cache line Alignment independent .Intel Provides a tool VTune Used for analysis app Performance of , Than perf More accurate and detailed . Then we VTune Tools to run to see . take rust Compare the speed of the two versions of , They came out separately and got two Summary The report :
Faster reports
Reports of slower runs
Comparing the two reports, we can see , Both have high branch prediction failure rates ( There is still room for optimization ), But the difference is caused by the three items circled in red . There are two things involved : DSB and MITE.
I have a general understanding of these two things :
- intel modern CPU Will convert the program's machine instructions into finer grained microinstructions (uops), The main purpose is to realize the out of order execution of instructions ,MITE Is the engine that performs this transformation , Like a compiler .
- because MITE It takes time and effort , A little bit new CPU And the introduction of DSB To cache the conversion results , Similar to compile cache , But the cache capacity may be very small .
Then let's combine the tools to understand , Some comments are given in the tool :
DSB Switches
Metric Description
Intel microarchitecture code name Sandy Bridge introduces a new decoded ICache. This cache, called the DSB (Decoded Stream Buffer), stores uOps that have already been decoded, avoiding many of the penalties of the legacy decode pipeline, called the MITE (Micro-instruction Translation Engine). However, when control flows out of the region cached in the DSB, the front-end incurs a penalty as uOp issue switches from the DSB to the MITE. This metric measures this penalty.
Possible Issues
A significant portion of cycles is spent switching from the DSB to the MITE. This may happen if a hot code region is too large to fit into the DSB.
Tips
Consider changing code layout (for example, via profile-guided optimization) to help your hot regions fit into the DSB.
Front-End Bandwidth
Metric Description
This metric represents a fraction of slots during which CPU was stalled due to front-end bandwidth issues, such as inefficiencies in the instruction decoders or code restrictions for caching in the DSB (decoded uOps cache). In such cases, the front-end typically delivers a non-optimal amount of uOps to the back-end.
front summay I can roughly understand the meaning of the three differences in :
DSB Switches: The slow one comes from DSB The hit rate of fetch instruction is low , More are switched to MITE Now we have compiled .
Front-End Bandwidth MITE: The slow spend on MITE There is more time on ,MITE Busy .
Front-End Bandwidth DSB: The slow ones spend from DSB It takes more time to fetch instructions ( This should echo the first one ?).
In summary, it is When it's slow DSB The hit rate is low , More time is spent on MITE On .
Why is there a difference in hit rate ? because DSB The cache is a block of code , therefore , It depends on whether our hot blocks are aligned to DSB The frame of .
Summary
So the conclusion is still : This micro evaluation result is wrong , Differences are related to instruction alignment , Belong to noise , Someone compiled C fast , Someone compiled Rust fast , It's all luck to see where the compiler aligns the instructions , Can not reflect the differences between languages .
The above analysis is based on i7 9700K Conduct , Other CPU It may be different , There may be a similar mechanism . About DSB, I can't find any more details , I don't know my CPU DSB How big is the , If there is any mistake, please tap .
边栏推荐
猜你喜欢
[email protected][2389:1: columnNameTypeOrConstraint : ( ( tableConstraint ) | ( columnNameT"/>NoViableAltException([email protected][2389:1: columnNameTypeOrConstraint : ( ( tableConstraint ) | ( columnNameT

Matlab GUI interface simulation DC motor and AC motor speed simulation

2022 CISP-PTE(一)文件包含

Redis cache penetration, cache breakdown, cache avalanche

Gérer 1000 serveurs par personne? Cet outil d'automatisation o & M doit être maîtrisé

YOLOv6又快又准的目标检测框架 已开源
![[graduation season] graduation is the new beginning of your life journey. Are you ready](/img/4e/aa763455da974d9576a31568fc6625.jpg)
[graduation season] graduation is the new beginning of your life journey. Are you ready

面试官:用分库分表如何做到永不迁移数据和避免热点问题?

From 5 seconds to 1 second, the system flies

2018 mathematical modeling competition - special clothing design for high temperature operation
随机推荐
【LeetCode】Day90-二叉搜索树中第K小的元素
window右键管理
Jupiter notebook file directory
Use uview to enable tabbar to display the corresponding number of tabbars according to permissions
MPC control of aircraft wingtip acceleration and control surface
面试官:请你介绍一下缓存穿透、缓存空值、缓存雪崩、缓存击穿的,通俗易懂
【Kevin三连弹之三】Rust真的比C慢吗?进一步分析queen微测评
R 中的 RNA-Seq 数据分析 - 调查数据中的差异表达基因!
使用 Blackbox Exporter 测试网络连通性
[graduation season] graduation is the new beginning of your life journey. Are you ready
webscoket 数据库监听
extendible hashing
NoViableAltException([email protected][2389:1: columnNameTypeOrConstraint : ( ( tableConstraint ) | ( columnNameT
Partial function of Scala
boundvalueops和opsforvalue区别
win10远程连接云服务器
Xiaomi Interviewer: let's talk about the proficient Registration Center for three days and three nights
专业四第二周自测
POI export excle
Unrecognized VM option ‘‘