当前位置:网站首页>Recommended open source tools: MegPeak, a high-performance computing tool
Recommended open source tools: MegPeak, a high-performance computing tool
2022-07-30 14:29:00 【PaperWeekly】

In the work force demand explosion under the background of,How to play out the biggest work force existing hardware becomes very important,Intuitive point is:We need to perfection of the existing algorithm for a particular processor performance optimization,Try to meet the current AI The high demand for algorithm to calculate.
Performance optimization in order to be able to do the utmost,We may have:
优化算法,Enables the algorithm to under the premise that satisfy the accuracy,To fetch and as far as possible little computation
优化程序,Make a program to implement these algorithm play a maximum processor performance
In the process of optimization program,首先要解决的问题是:How to evaluate the program played a processor several into the work force,And further optimize the space and the optimization direction.
In order to understand our processor,旷视 MegEngine 团队开发了一个工具 MegPeak,To help developers performance assessment,Development guidance, etc,目前已经开源.Click below to read the original,可了解更多 MegPeak 的使用方法、原理等.
GitHub项目地址:
https://github.com/MegEngine/MegPeak
MegPeak功能
通过 MegPeak ,The user can test the target processor:
指令的峰值带宽
指令延迟
Memory bandwidth
Any instructionCombine bandwidth
Although part of the above information can check related data through the chip data sheet,Then combined with the theoretical calculation get,But in many cases, unable to get the performance of the target processor detailed document,另外通过 MegPeak Measurement is more direct and accurate,Bandwidth and can test the specific assembly instructions.MegEngine 团队使用 MegPeak In several kinds of commonly used ARM 架构 CPU 上进行测试,According to the instructions fmla The test results of sorting out the table below.

其中,GFLOPS Indicators to measure equipment is force,而 FLOPS/Cycle Indexes can help predict CPU 的硬件特征.下面以 A55/A77/Apple M1 分别举例说明.
A55:由于每条指令 fmla Can perform two floating-point arithmetic(Including one multiplication and addition),And testing FLOPS/Cycle 指标接近 8,Therefore, presumably A55 Backend execution unit has a 128 A floating point unit vector by adding or two 64 A floating point unit vector by add.
A77:其 FLOPS/Cycle Index is about 16,So each cycle A77 可以执行 2 条 SIMD 的 fmla 指令,So their backend has two SIMD fmla 执行单元,And the backend is at least double fired.
Apple M1:Apple M1 的 FLOPS/Cycle 指标达到了 32,That has 4 个 SIMD 执行单元.
用MegPeak测到的数据
可以用来干什么?
MegPeak You can test out the processor's memory bandwidth,The theoretical calculation of the instruction peak,Instruction information such as the delay,So can help us:
绘制 Roofline Model To guide our optimization model of performance
Assessment process optimization space
Explore the theory calculation of assembly instructions peak
另外 MegPeak You can also provide validation of the theory,If we through the processor frequency*Single-core single cycle instruction emission quantity*Each instruction execution can calculate the amount of calculation of theoretical calculation of peak,然后我们可以通过 MegPeak Actual measurement to verify.
Drawing instruction related toRoofline Model

Roofline Model is widely used in high performance computing,Optimization direction is evaluation algorithm can be optimized and an important tool to.使用 MegPeak Can draw more specific about the instruction corresponding Roofline 模型,如:在CPU中,不同的数据类型,Although to fetch bandwidth will not change,But the calculation is a big gap between the peak,比如在 arm 上 float The calculation of the peak and int8 The calculation of the peak difference.
Assessment code optimization space
In the optimization of concrete algorithm,可以通过 MegPeak 测试出 kernel Inside the main instruction of maximum peak,如在 Arm 上优化 fp32 Matmul 的时候,Mainly used in instruction is fmla 指令,At that time can run test program actual peak,Instruction of the peak and the procedure of the smaller gap between peak,That code optimization, the better.
另外,Can according to the algorithm calculates the amount of calculation and to visit the stock,并使用 MegPeak Draw the above Roofline,By calculating the actual calculation density,然后再对应到 Roofline 中,If the density of the above green area,That program needs more consideration to optimize to fetch,To provide better fetch model,如分块,提前 pack 数据等.If the calculation intensity point fall in the gray area,Code is the best,If you still want to further speed up,Can only be considered from the point of view of algorithm to optimize the,如:在卷积中使用 FFT,Winograd Such algorithm is optimized.
To explore the optimal assembly instructions
很多 Kernel Optimization isn't simply a instruction can measure,Takes the combination of multiple instructions to represent the whole Kernel 的计算,So we need to explore how to organize these instructions to the processor optimal performance.下面列举在 A53 Small nuclear optimization fp32 Matmul 的过程中,由于 Matmul Is computationally intensive operator, Considered by many hidden to fetch instruction of overhead,使用 MegPeak Cooperate to analyze,To explore how to combine the orders as much as possible to launch more.
Because small nuclear resources,Instruction multiple launch has many restrictions:
首先使用 MegPeak 测试出 A53 上 fp32 的 fmla Instructions to calculate peak,将其定义为 100% 峰值计算性能.
Test which assembly instructions can support dual launch:
在 MegPeak 中添加 vector load 和 fmla 1:1 组合的代码,Then test its peak just as float 峰值的 36%,表明 Vector load 和 fmla Can't double fired
Can also be measured general-purpose registers load 指令 ldr+fmla The combination of can achieve float 峰值的 93%,说明 ldr 可以和 fmla 双发射
Same as above can be measured ins + fmla Can double,ins + vector load 64 Who can double fired
根据 Matmul 最内层 Kernel 的计算原理,Such as the innermost Kernel 的分块大小是 8x12,That the innermost needs to read:20 个 float 数据,计算 24 次 fmla 计算.
结合上面的 MegPeak 测试的信息,We need to find with the clock to finish this at least 20 个 float 数据 load,和 24 次 fmla Data computing assembly instructions,So you need to as much as possible the data load 和 fmla For double launch,隐藏数据 load 的耗时.
The final assembly instructions is:
使用 vector load 64 指令 + ldr + ins 组合成为一个 neon 寄存器数据,因为 ldr 和 ins 都可以和 fmla 双发射,And them fmla Together can hide their time
在这 3 Instruction with fmla 指令,And as far as possible to solve the data dependence
According to the instructions above combination can make Matmul On the small nuclear peak calculation 70% 左右.
总结
MegPeak As an auxiliary tool for high performance computing,Allows developers to easily obtain the target processor's internal details,Supplementary assessment on the performance of the code,As well as the optimization method to design.但是 MegPeak There are also some need rich direction:
1. Support for more processor performance data,如:L1,L2 cache 的大小,Automatic discovery double transmission of various assembly instructions,And probably draw a processor backend thumbnails.如:
https://en.wikichip.org/w/images/5/57/cortex-a76_block_diagram.svg •
2. Support measure mobile end OpenCL 的更多细节信息,如:warp size,local memory 大小等.
If there are any students interested in the above function,欢迎大家提交代码.最后欢迎大家使用 MegPeak.
现在,在「知乎」也能找到我们了
进入知乎首页搜索「PaperWeekly」
点击「关注」订阅我们的专栏吧
·

边栏推荐
- Cookie simulation login "recommended collection"
- 代码杂谈:从一道面试题看学会Rust的难度
- 以unity3d为例解读:游戏数据加密
- What should I do if the sql server installation fails (what should I do if the sql server cannot be installed)
- Skywalking入门
- OFDM 十六讲 3- OFDM Waveforms
- jsArray array copy method performance test 2207300040
- “12306” 的架构到底有多牛逼
- Web消息推送之SSE
- 跳槽前,把自己弄成卷王
猜你喜欢

(一)Multisim安装与入门

NFTScan 与 PANews 联合发布多链 NFT 数据分析报告

ECCV 2022 | 通往数据高效的Transformer目标检测器

深入浅出零钱兑换问题——背包问题的套壳

权威推荐!腾讯安全DDoS边缘安全产品获国际研究机构Omdia认可

Why do software testing have to learn automation?Talk about the value of automated testing in my eyes

激光雷达点云语义分割论文阅读小结

LeetCode二叉树系列——107.二叉树的层序遍历II

LeetCode二叉树系列——145.二叉树的后序遍历

svg波浪动画js特效代码
随机推荐
CF338E Optimize!
查阅所连接过的WiFi所有信息(含密码)(访问历史所有WiFi连接)
Six-faced ant financial clothing, resisting the bombardment of the interviewer, came to interview for review
Learning notes - 7 weeks as data analyst "in the first week: data analysis of thinking"
The main content of terrain analysis (the special effect level of the wandering earth)
无代码开发平台应用可见权限设置入门教程
ddl and dml in sql (the difference between sql and access)
Logic Vulnerability----Permission Vulnerability
mongodb打破原则引入SQL,它到底想要干啥?
05 | login background: based on the password login mode (below)
The truth of the industry: I will only test those that have no future, and I panic...
创意loadingjs特效小点跳跃动画
00后测试员摸爬滚打近一年,为是否要转行或去学软件测试的学弟们总结出了以下走心建议
高性能数据访问中间件 OBProxy(三):问题排查和服务运维
经典测试面试题集—逻辑推理题
What should I do if the sql server installation fails (what should I do if the sql server cannot be installed)
sql中ddl和dml(sql与access的区别)
AT4108 [ARC094D] Normalization
CF338E Optimize!
[VMware virtual machine installation mysql5.7 tutorial]