当前位置:网站首页>[Kevin's third play in a row] is rust really slower than C? Further analyze queen micro assessment

[Kevin's third play in a row] is rust really slower than C? Further analyze queen micro assessment

2022-06-27 07:19:00 51CTO


author :Kevin Wang


yesterday , In this article, I analyzed the unreliability of micro assessment , Left a little technical details that were not analyzed clearly . Comment area @ Wangmingzhe suggested that I could use VTune Tools , I'll continue to analyze it today .

Previous review

In yesterday's Micro assessment , I passed through queen.rs Add... At the beginning of the program NOP Instructions and queen.c Call the test function and perf analysis , It is proved that the difference is caused by the code tested in the compilation results layout( Address location of relevant code in memory ) Resulting noise .

Today, use the tools to further analyze layout What has been affected .

Or from Canada NOP Start

Write a script , Automatic separate test at queen.rs Add 1 individual 、2 individual ...N individual NOP, See if there is any regularity in the impact on the test results .

So I got the following results :

【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ cache

Draw it as a graph :

【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ compiler _02queen.rs Test time and added NOP The number of relations

Same method , We are queen.c Of main Add... At the beginning of the function NOP To get C Language version of the curve .

【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ compiler _03queen.c Test time and added NOP The number of relations

You can see Rust The average value of the edition 533 And variance 177 comparison C The version should be better !( Of course, this can not be used as a standard to judge language performance ,CPU The optimization of the microarchitecture layer is too complicated and subtle , Special cases cannot be used to illustrate the problem

C and Rust The execution time of both versions is about inserting NOP Count 16 It is a half cycle and changes periodically , So this cycle 16 What the hell is it ? Looking at the assembly code, I found that the compiler will automatically align the loop body to 16byte:

【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ide_04 Add 9 individual nop A compilation of ,Block 2 Is the beginning of the first loop body 【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ cache _05 Add 10 individual NOP A compilation of ,Block 2 Is the alignment inserted by the compiler NOP, The original Block 2 Moved back 16byte become Block 3

Get a clue :  In this case , Align the beginning of the first loop body to an odd number 16byte You will get better performance .

use VTune To run the

Yesterday we analyzed and cache line Alignment independent .Intel Provides a tool VTune Used for analysis app Performance of , Than perf More accurate and detailed . Then we VTune Tools to run to see . take rust Compare the speed of the two versions of , They came out separately and got two Summary The report :

【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ cache _06 Faster reports 【Kevin Three rounds in a row 】Rust True ratio C Slow? ? Further analysis queen Micro evaluation _ compiler _07 Reports of slower runs


Comparing the two reports, we can see , Both have high branch prediction failure rates ( There is still room for optimization ), But the difference is caused by the three items circled in red . There are two things involved : DSB and MITE.

I have a general understanding of these two things :

  • intel modern CPU Will convert the program's machine instructions into finer grained microinstructions (uops), The main purpose is to realize the out of order execution of instructions ,MITE Is the engine that performs this transformation , Like a compiler .
  • because MITE It takes time and effort , A little bit new CPU And the introduction of DSB To cache the conversion results , Similar to compile cache , But the cache capacity may be very small .

Then let's combine the tools to understand , Some comments are given in the tool :

DSB Switches
Metric Description
Intel microarchitecture code name Sandy Bridge introduces a new decoded ICache. This cache, called the DSB (Decoded Stream Buffer), stores uOps that have already been decoded, avoiding many of the penalties of the legacy decode pipeline, called the MITE (Micro-instruction Translation Engine). However, when control flows out of the region cached in the DSB, the front-end incurs a penalty as uOp issue switches from the DSB to the MITE. This metric measures this penalty.
Possible Issues
A significant portion of cycles is spent switching from the DSB to the MITE. This may happen if a hot code region is too large to fit into the DSB.
Tips
Consider changing code layout (for example, via profile-guided optimization) to help your hot regions fit into the DSB.
Front-End Bandwidth
Metric Description
This metric represents a fraction of slots during which CPU was stalled due to front-end bandwidth issues, such as inefficiencies in the instruction decoders or code restrictions for caching in the DSB (decoded uOps cache). In such cases, the front-end typically delivers a non-optimal amount of uOps to the back-end.

front summay I can roughly understand the meaning of the three differences in :

DSB Switches: The slow one comes from DSB The hit rate of fetch instruction is low , More are switched to MITE Now we have compiled .

Front-End Bandwidth MITE: The slow spend on MITE There is more time on ,MITE Busy .

Front-End Bandwidth DSB: The slow ones spend from DSB It takes more time to fetch instructions ( This should echo the first one ?).

In summary, it is When it's slow DSB The hit rate is low , More time is spent on MITE On .

Why is there a difference in hit rate ? because DSB The cache is a block of code , therefore , It depends on whether our hot blocks are aligned to DSB The frame of .

Summary

So the conclusion is still : This micro evaluation result is wrong , Differences are related to instruction alignment , Belong to noise , Someone compiled C fast , Someone compiled Rust fast , It's all luck to see where the compiler aligns the instructions , Can not reflect the differences between languages .

The above analysis is based on i7 9700K Conduct , Other CPU It may be different , There may be a similar mechanism . About DSB, I can't find any more details , I don't know my CPU DSB How big is the , If there is any mistake, please tap .


原网站

版权声明
本文为[51CTO]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/178/202206270632017668.html