当前位置:网站首页>Recommended open source tools: MegPeak, a high-performance computing tool
Recommended open source tools: MegPeak, a high-performance computing tool
2022-07-30 14:29:00 【PaperWeekly】

In the work force demand explosion under the background of,How to play out the biggest work force existing hardware becomes very important,Intuitive point is:We need to perfection of the existing algorithm for a particular processor performance optimization,Try to meet the current AI The high demand for algorithm to calculate.
Performance optimization in order to be able to do the utmost,We may have:
优化算法,Enables the algorithm to under the premise that satisfy the accuracy,To fetch and as far as possible little computation
优化程序,Make a program to implement these algorithm play a maximum processor performance
In the process of optimization program,首先要解决的问题是:How to evaluate the program played a processor several into the work force,And further optimize the space and the optimization direction.
In order to understand our processor,旷视 MegEngine 团队开发了一个工具 MegPeak,To help developers performance assessment,Development guidance, etc,目前已经开源.Click below to read the original,可了解更多 MegPeak 的使用方法、原理等.
GitHub项目地址:
https://github.com/MegEngine/MegPeak
MegPeak功能
通过 MegPeak ,The user can test the target processor:
指令的峰值带宽
指令延迟
Memory bandwidth
Any instructionCombine bandwidth
Although part of the above information can check related data through the chip data sheet,Then combined with the theoretical calculation get,But in many cases, unable to get the performance of the target processor detailed document,另外通过 MegPeak Measurement is more direct and accurate,Bandwidth and can test the specific assembly instructions.MegEngine 团队使用 MegPeak In several kinds of commonly used ARM 架构 CPU 上进行测试,According to the instructions fmla The test results of sorting out the table below.

其中,GFLOPS Indicators to measure equipment is force,而 FLOPS/Cycle Indexes can help predict CPU 的硬件特征.下面以 A55/A77/Apple M1 分别举例说明.
A55:由于每条指令 fmla Can perform two floating-point arithmetic(Including one multiplication and addition),And testing FLOPS/Cycle 指标接近 8,Therefore, presumably A55 Backend execution unit has a 128 A floating point unit vector by adding or two 64 A floating point unit vector by add.
A77:其 FLOPS/Cycle Index is about 16,So each cycle A77 可以执行 2 条 SIMD 的 fmla 指令,So their backend has two SIMD fmla 执行单元,And the backend is at least double fired.
Apple M1:Apple M1 的 FLOPS/Cycle 指标达到了 32,That has 4 个 SIMD 执行单元.
用MegPeak测到的数据
可以用来干什么?
MegPeak You can test out the processor's memory bandwidth,The theoretical calculation of the instruction peak,Instruction information such as the delay,So can help us:
绘制 Roofline Model To guide our optimization model of performance
Assessment process optimization space
Explore the theory calculation of assembly instructions peak
另外 MegPeak You can also provide validation of the theory,If we through the processor frequency*Single-core single cycle instruction emission quantity*Each instruction execution can calculate the amount of calculation of theoretical calculation of peak,然后我们可以通过 MegPeak Actual measurement to verify.
Drawing instruction related toRoofline Model

Roofline Model is widely used in high performance computing,Optimization direction is evaluation algorithm can be optimized and an important tool to.使用 MegPeak Can draw more specific about the instruction corresponding Roofline 模型,如:在CPU中,不同的数据类型,Although to fetch bandwidth will not change,But the calculation is a big gap between the peak,比如在 arm 上 float The calculation of the peak and int8 The calculation of the peak difference.
Assessment code optimization space
In the optimization of concrete algorithm,可以通过 MegPeak 测试出 kernel Inside the main instruction of maximum peak,如在 Arm 上优化 fp32 Matmul 的时候,Mainly used in instruction is fmla 指令,At that time can run test program actual peak,Instruction of the peak and the procedure of the smaller gap between peak,That code optimization, the better.
另外,Can according to the algorithm calculates the amount of calculation and to visit the stock,并使用 MegPeak Draw the above Roofline,By calculating the actual calculation density,然后再对应到 Roofline 中,If the density of the above green area,That program needs more consideration to optimize to fetch,To provide better fetch model,如分块,提前 pack 数据等.If the calculation intensity point fall in the gray area,Code is the best,If you still want to further speed up,Can only be considered from the point of view of algorithm to optimize the,如:在卷积中使用 FFT,Winograd Such algorithm is optimized.
To explore the optimal assembly instructions
很多 Kernel Optimization isn't simply a instruction can measure,Takes the combination of multiple instructions to represent the whole Kernel 的计算,So we need to explore how to organize these instructions to the processor optimal performance.下面列举在 A53 Small nuclear optimization fp32 Matmul 的过程中,由于 Matmul Is computationally intensive operator, Considered by many hidden to fetch instruction of overhead,使用 MegPeak Cooperate to analyze,To explore how to combine the orders as much as possible to launch more.
Because small nuclear resources,Instruction multiple launch has many restrictions:
首先使用 MegPeak 测试出 A53 上 fp32 的 fmla Instructions to calculate peak,将其定义为 100% 峰值计算性能.
Test which assembly instructions can support dual launch:
在 MegPeak 中添加 vector load 和 fmla 1:1 组合的代码,Then test its peak just as float 峰值的 36%,表明 Vector load 和 fmla Can't double fired
Can also be measured general-purpose registers load 指令 ldr+fmla The combination of can achieve float 峰值的 93%,说明 ldr 可以和 fmla 双发射
Same as above can be measured ins + fmla Can double,ins + vector load 64 Who can double fired
根据 Matmul 最内层 Kernel 的计算原理,Such as the innermost Kernel 的分块大小是 8x12,That the innermost needs to read:20 个 float 数据,计算 24 次 fmla 计算.
结合上面的 MegPeak 测试的信息,We need to find with the clock to finish this at least 20 个 float 数据 load,和 24 次 fmla Data computing assembly instructions,So you need to as much as possible the data load 和 fmla For double launch,隐藏数据 load 的耗时.
The final assembly instructions is:
使用 vector load 64 指令 + ldr + ins 组合成为一个 neon 寄存器数据,因为 ldr 和 ins 都可以和 fmla 双发射,And them fmla Together can hide their time
在这 3 Instruction with fmla 指令,And as far as possible to solve the data dependence
According to the instructions above combination can make Matmul On the small nuclear peak calculation 70% 左右.
总结
MegPeak As an auxiliary tool for high performance computing,Allows developers to easily obtain the target processor's internal details,Supplementary assessment on the performance of the code,As well as the optimization method to design.但是 MegPeak There are also some need rich direction:
1. Support for more processor performance data,如:L1,L2 cache 的大小,Automatic discovery double transmission of various assembly instructions,And probably draw a processor backend thumbnails.如:
https://en.wikichip.org/w/images/5/57/cortex-a76_block_diagram.svg •
2. Support measure mobile end OpenCL 的更多细节信息,如:warp size,local memory 大小等.
If there are any students interested in the above function,欢迎大家提交代码.最后欢迎大家使用 MegPeak.
现在,在「知乎」也能找到我们了
进入知乎首页搜索「PaperWeekly」
点击「关注」订阅我们的专栏吧
·

边栏推荐
- AT4108 [ARC094D] Normalization
- (HR面试)最常见的面试问题和技巧性答复
- 新一代开源免费的终端工具,太酷了
- 戴墨镜的卡通太阳SVG动画js特效
- canvas彩虹桥动画js特效
- [Advanced ROS] Lecture 11 Robot co-simulation based on Gazebo and Rviz (motion control and sensors)
- Flask Framework - Flask-Mail Mail
- 数据中台建设(五):打破企业数据孤岛和提取数据价值
- There is a risk of water ingress in the battery pack tray and there is a potential safety hazard. 52,928 Tang DMs are urgently recalled
- redis6.0 源码学习(五)ziplist
猜你喜欢

跳槽前,把自己弄成卷王

【Advanced Mathematics】【7】Double Integral

开源工具推荐:高性能计算辅助工具MegPeak

新一代开源免费的终端工具,太酷了

吃透Chisel语言.29.Chisel进阶之通信状态机(一)——通信状态机:以闪光灯为例

The truth of the industry: I will only test those that have no future, and I panic...

OFDM Sixteen Lectures 3- OFDM Waveforms

Shell变量与赋值、变量运算、特殊变量、重定向与管渠

接口自动化框架,lm-easytest内测版发布,赶紧用起来~

Still saying software testing doesn't have a midlife crisis?9 years of test engineers were eliminated
随机推荐
MIMO雷达波形设计
LeetCode二叉树系列——199二叉树的右视图
20220729 Securities, Finance
Androd 跳转到google应用市场
UPC2022 Summer Individual Training Game 19 (B, P)
No-code development platform application visible permission setting introductory tutorial
Baijiahao cancels the function of posting documents on the interface: the weight of the plug-in chain is blocked
CF1677E Tokitsukaze and Beautiful Subsegments
3 years of software testing experience, the interview requires a monthly salary of 22K, obviously he has memorized a lot of interview questions...
(HR Interview) Most Common Interview Questions and Skilled Answers
NFTScan 与 PANews 联合发布多链 NFT 数据分析报告
Classic test interview questions set - logical reasoning questions
Hello,World
权威推荐!腾讯安全DDoS边缘安全产品获国际研究机构Omdia认可
Learning notes - 7 weeks as data analyst "in the first week: data analysis of thinking"
Flask Framework - Flask-Mail Mail
[C# 循环跳转]-C# 中的 while/do-while/for/foreach 循环结构以及 break/continue 跳转语句
sql中ddl和dml(sql与access的区别)
SQL 改写系列七:谓词移动
Eclipse connects to SQL server database "recommended collection"