当前位置:网站首页>Recommended open source tools: MegPeak, a high-performance computing tool
Recommended open source tools: MegPeak, a high-performance computing tool
2022-07-30 14:29:00 【PaperWeekly】

In the work force demand explosion under the background of,How to play out the biggest work force existing hardware becomes very important,Intuitive point is:We need to perfection of the existing algorithm for a particular processor performance optimization,Try to meet the current AI The high demand for algorithm to calculate.
Performance optimization in order to be able to do the utmost,We may have:
优化算法,Enables the algorithm to under the premise that satisfy the accuracy,To fetch and as far as possible little computation
优化程序,Make a program to implement these algorithm play a maximum processor performance
In the process of optimization program,首先要解决的问题是:How to evaluate the program played a processor several into the work force,And further optimize the space and the optimization direction.
In order to understand our processor,旷视 MegEngine 团队开发了一个工具 MegPeak,To help developers performance assessment,Development guidance, etc,目前已经开源.Click below to read the original,可了解更多 MegPeak 的使用方法、原理等.
GitHub项目地址:
https://github.com/MegEngine/MegPeak
MegPeak功能
通过 MegPeak ,The user can test the target processor:
指令的峰值带宽
指令延迟
Memory bandwidth
Any instructionCombine bandwidth
Although part of the above information can check related data through the chip data sheet,Then combined with the theoretical calculation get,But in many cases, unable to get the performance of the target processor detailed document,另外通过 MegPeak Measurement is more direct and accurate,Bandwidth and can test the specific assembly instructions.MegEngine 团队使用 MegPeak In several kinds of commonly used ARM 架构 CPU 上进行测试,According to the instructions fmla The test results of sorting out the table below.

其中,GFLOPS Indicators to measure equipment is force,而 FLOPS/Cycle Indexes can help predict CPU 的硬件特征.下面以 A55/A77/Apple M1 分别举例说明.
A55:由于每条指令 fmla Can perform two floating-point arithmetic(Including one multiplication and addition),And testing FLOPS/Cycle 指标接近 8,Therefore, presumably A55 Backend execution unit has a 128 A floating point unit vector by adding or two 64 A floating point unit vector by add.
A77:其 FLOPS/Cycle Index is about 16,So each cycle A77 可以执行 2 条 SIMD 的 fmla 指令,So their backend has two SIMD fmla 执行单元,And the backend is at least double fired.
Apple M1:Apple M1 的 FLOPS/Cycle 指标达到了 32,That has 4 个 SIMD 执行单元.
用MegPeak测到的数据
可以用来干什么?
MegPeak You can test out the processor's memory bandwidth,The theoretical calculation of the instruction peak,Instruction information such as the delay,So can help us:
绘制 Roofline Model To guide our optimization model of performance
Assessment process optimization space
Explore the theory calculation of assembly instructions peak
另外 MegPeak You can also provide validation of the theory,If we through the processor frequency*Single-core single cycle instruction emission quantity*Each instruction execution can calculate the amount of calculation of theoretical calculation of peak,然后我们可以通过 MegPeak Actual measurement to verify.
Drawing instruction related toRoofline Model

Roofline Model is widely used in high performance computing,Optimization direction is evaluation algorithm can be optimized and an important tool to.使用 MegPeak Can draw more specific about the instruction corresponding Roofline 模型,如:在CPU中,不同的数据类型,Although to fetch bandwidth will not change,But the calculation is a big gap between the peak,比如在 arm 上 float The calculation of the peak and int8 The calculation of the peak difference.
Assessment code optimization space
In the optimization of concrete algorithm,可以通过 MegPeak 测试出 kernel Inside the main instruction of maximum peak,如在 Arm 上优化 fp32 Matmul 的时候,Mainly used in instruction is fmla 指令,At that time can run test program actual peak,Instruction of the peak and the procedure of the smaller gap between peak,That code optimization, the better.
另外,Can according to the algorithm calculates the amount of calculation and to visit the stock,并使用 MegPeak Draw the above Roofline,By calculating the actual calculation density,然后再对应到 Roofline 中,If the density of the above green area,That program needs more consideration to optimize to fetch,To provide better fetch model,如分块,提前 pack 数据等.If the calculation intensity point fall in the gray area,Code is the best,If you still want to further speed up,Can only be considered from the point of view of algorithm to optimize the,如:在卷积中使用 FFT,Winograd Such algorithm is optimized.
To explore the optimal assembly instructions
很多 Kernel Optimization isn't simply a instruction can measure,Takes the combination of multiple instructions to represent the whole Kernel 的计算,So we need to explore how to organize these instructions to the processor optimal performance.下面列举在 A53 Small nuclear optimization fp32 Matmul 的过程中,由于 Matmul Is computationally intensive operator, Considered by many hidden to fetch instruction of overhead,使用 MegPeak Cooperate to analyze,To explore how to combine the orders as much as possible to launch more.
Because small nuclear resources,Instruction multiple launch has many restrictions:
首先使用 MegPeak 测试出 A53 上 fp32 的 fmla Instructions to calculate peak,将其定义为 100% 峰值计算性能.
Test which assembly instructions can support dual launch:
在 MegPeak 中添加 vector load 和 fmla 1:1 组合的代码,Then test its peak just as float 峰值的 36%,表明 Vector load 和 fmla Can't double fired
Can also be measured general-purpose registers load 指令 ldr+fmla The combination of can achieve float 峰值的 93%,说明 ldr 可以和 fmla 双发射
Same as above can be measured ins + fmla Can double,ins + vector load 64 Who can double fired
根据 Matmul 最内层 Kernel 的计算原理,Such as the innermost Kernel 的分块大小是 8x12,That the innermost needs to read:20 个 float 数据,计算 24 次 fmla 计算.
结合上面的 MegPeak 测试的信息,We need to find with the clock to finish this at least 20 个 float 数据 load,和 24 次 fmla Data computing assembly instructions,So you need to as much as possible the data load 和 fmla For double launch,隐藏数据 load 的耗时.
The final assembly instructions is:
使用 vector load 64 指令 + ldr + ins 组合成为一个 neon 寄存器数据,因为 ldr 和 ins 都可以和 fmla 双发射,And them fmla Together can hide their time
在这 3 Instruction with fmla 指令,And as far as possible to solve the data dependence
According to the instructions above combination can make Matmul On the small nuclear peak calculation 70% 左右.
总结
MegPeak As an auxiliary tool for high performance computing,Allows developers to easily obtain the target processor's internal details,Supplementary assessment on the performance of the code,As well as the optimization method to design.但是 MegPeak There are also some need rich direction:
1. Support for more processor performance data,如:L1,L2 cache 的大小,Automatic discovery double transmission of various assembly instructions,And probably draw a processor backend thumbnails.如:
https://en.wikichip.org/w/images/5/57/cortex-a76_block_diagram.svg •
2. Support measure mobile end OpenCL 的更多细节信息,如:warp size,local memory 大小等.
If there are any students interested in the above function,欢迎大家提交代码.最后欢迎大家使用 MegPeak.
现在,在「知乎」也能找到我们了
进入知乎首页搜索「PaperWeekly」
点击「关注」订阅我们的专栏吧
·

边栏推荐
- [C# 循环跳转]-C# 中的 while/do-while/for/foreach 循环结构以及 break/continue 跳转语句
- 代码杂谈:从一道面试题看学会Rust的难度
- Simple understanding of Precision, Recall, Accuracy, TP, TN, FP, FN
- 无代码开发平台应用可见权限设置入门教程
- 查阅所连接过的WiFi所有信息(含密码)(访问历史所有WiFi连接)
- Study Notes - Becoming a Data Analyst in Seven Weeks "Week 2: Business": Business Analysis Metrics
- 无代码开发平台全部应用设置入门教程
- SQL 改写系列七:谓词移动
- Still saying software testing doesn't have a midlife crisis?9 years of test engineers were eliminated
- There is a risk of water ingress in the battery pack tray and there is a potential safety hazard. 52,928 Tang DMs are urgently recalled
猜你喜欢
随机推荐
[ARC092B] Two Sequences
jsArray array copy method performance test 2207300040
jsArray array copy method performance test 2207300823
Flask框架——Flask-SQLite数据库
Simple understanding of Precision, Recall, Accuracy, TP, TN, FP, FN
5. DOM
接口自动化框架,lm-easytest内测版发布,赶紧用起来~
The path to uniting the programmer: "titles bucket" to the highest state of pragmatic
CF603E Pastoral Oddities
What is defect analysis?An article takes you to understand the necessary skills of test engineers
CF338E Optimize!
3 years of software testing experience, the interview requires a monthly salary of 22K, obviously he has memorized a lot of interview questions...
(论文翻译]未配对Image-To-Image翻译使用Cycle-Consistent敌对的网络
UPC2022暑期个人训练赛第19场(B,P)
JSON常用注解
Learning notes - 7 weeks as data analyst "in the first week: data analysis of thinking"
Before quitting, make yourself a roll king
Hello,World
数据中台建设(五):打破企业数据孤岛和提取数据价值
“12306” 的架构到底有多牛逼








