当前位置:网站首页>Neon Optimization: an instruction optimization case of matrix transpose
Neon Optimization: an instruction optimization case of matrix transpose
2022-07-07 01:14:00 【To know】
NEON Optimize : Matrix transpose instruction optimization case
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
Transpose operation is often used in matrix operation , Here, the atomic matrix 4x4 The transpose NEON Summary of optimization cases .
original C The function is responsible for M[4][4] Transpose to MT[4][4], The effect is as follows :
M[4][4]- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4
Optimization idea
There are two parallel computing methods to transpose it .
- Method 1
- be used 4 Orders :vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q First, a batch of cross reads from memory into registers
- trnq Use the internal binary transpose function , Transpose some rows and columns
- uzpq Use the deinterleaving read-write function , Transpose some rows and columns
- Then use registers to assign values to different rows
- st4q Write the register result to memory
- Method 2
- Use the cross read-write relationship between memory and registers , Two instructions realize transpose , Don't flip around in the register
- ld1q Realize reading data to register by line
- st4q Cross read register values and write them to memory
Sample code
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#define ROW_NUM 4
#define COL_NUM 4
int main(void)
{
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[ROW_NUM][COL_NUM] = {
0};
// to do this:
// a1 b1 c1 d1 => a1 a2 a3 a4
// a2 b2 c2 d2 => b1 b2 b3 b4
// ...
// a4 b4 c4 d4 => d1 d2 d3 d4
// origin
int32_t i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
MT[j][i] = M[i][j];
}
}
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method1
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD in val[0]: a1 b1 c1 d1, val[1]: a2 b2 c2 d2
float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01 in val[0]: a1 a2 c1 c2, val[1]: b1 b2 d1 d2
float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02, Group by line
float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13, Group by line
vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method2
float32x4x4_t vf32x4x4fTmp1ABCD;
vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); // Take advantage of the cross read and write feature , Put in MT Array
printf("ver3:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// Only transpose in register and do not output to memory
// float fTmpABCD4x4[4][4]; // Temporary transit array
// vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); // Suppose the data to be transposed is vf32x4x4fTmpABCD
// vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
// vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
// vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
// vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); // Put the transpose result into the register
return 0;
}
At the end of the above code , With... That transposes only in registers demo, It can be used according to specific scenarios .
Summary
Method 1
Transpose in the register and output to the memory
Only operate in registers ,6 Orders , Add 4 Assignments
Method 2
- Directly realize the transpose function through the cross reading of memory and register , Command down to 3 strip .
- Cross reading between registers and memory ,5 Orders
All in all , Practice knows , Only operate the scene in the register , Law 1 better , It is more efficient than the read-write interaction between memory and registers , Even if there are oneortwo more instructions . When you need to output the results to memory , Law 2 better .
边栏推荐
- Batch obtain the latitude coordinates of all administrative regions in China (to the county level)
- Installation and testing of pyflink
- [force buckle]41 Missing first positive number
- Build your own website (17)
- 【案例分享】网络环路检测基本功能配置
- Niuke cold training camp 6B (Freund has no green name level)
- [Niuke] b-complete square
- "Exquisite store manager" youth entrepreneurship incubation camp - the first phase of Shunde market has been successfully completed!
- mysql: error while loading shared libraries: libtinfo. so. 5: cannot open shared object file: No such
- 重上吹麻滩——段芝堂创始人翟立冬游记
猜你喜欢

界面控件DevExpress WinForms皮肤编辑器的这个补丁,你了解了吗?
![[user defined type] structure, union, enumeration](/img/a5/d6bcfb128ff6c64f9d18ac4c209210.jpg)
[user defined type] structure, union, enumeration

【案例分享】网络环路检测基本功能配置

重上吹麻滩——段芝堂创始人翟立冬游记

Lldp compatible CDP function configuration

资产安全问题或制约加密行业发展 风控+合规成为平台破局关键

Force buckle 1037 Effective boomerang

View remote test data and records anytime, anywhere -- ipehub2 and ipemotion app

Anfulai embedded weekly report no. 272: 2022.06.27--2022.07.03

Part VI, STM32 pulse width modulation (PWM) programming
随机推荐
UI控件Telerik UI for WinForms新主题——VS2022启发式主题
Taro2.* 小程序配置分享微信朋友圈
通过串口实现printf函数,中断实现串口数据接收
Openjudge noi 1.7 10: simple password
树莓派/arm设备上安装火狐Firefox浏览器
Dell Notebook Periodic Flash Screen Fault
Activereportsjs 3.1 Chinese version | | | activereportsjs 3.1 English version
Can the system hibernation file be deleted? How to delete the system hibernation file
Deeply explore the compilation and pile insertion technology (IV. ASM exploration)
【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析
Failed to successfully launch or connect to a child MSBuild. exe process. Verify that the MSBuild. exe
第六篇,STM32脉冲宽度调制(PWM)编程
[force buckle]41 Missing first positive number
Part 7: STM32 serial communication programming
Boot - Prometheus push gateway use
NEON优化:关于交叉存取与反向交叉存取
第四篇,STM32中断控制编程
负载均衡性能参数如何测评?
Pytorch中torch和torchvision的安装
Maidong Internet won the bid of Beijing life insurance to boost customers' brand value