当前位置:网站首页>Neon Optimization: an instruction optimization case of matrix transpose
Neon Optimization: an instruction optimization case of matrix transpose
2022-07-07 01:14:00 【To know】
NEON Optimize : Matrix transpose instruction optimization case
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
Transpose operation is often used in matrix operation , Here, the atomic matrix 4x4 The transpose NEON Summary of optimization cases .
original C The function is responsible for M[4][4] Transpose to MT[4][4], The effect is as follows :
M[4][4]- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4
Optimization idea
There are two parallel computing methods to transpose it .
- Method 1
- be used 4 Orders :vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q First, a batch of cross reads from memory into registers
- trnq Use the internal binary transpose function , Transpose some rows and columns
- uzpq Use the deinterleaving read-write function , Transpose some rows and columns
- Then use registers to assign values to different rows
- st4q Write the register result to memory
- Method 2
- Use the cross read-write relationship between memory and registers , Two instructions realize transpose , Don't flip around in the register
- ld1q Realize reading data to register by line
- st4q Cross read register values and write them to memory
Sample code
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#define ROW_NUM 4
#define COL_NUM 4
int main(void)
{
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[ROW_NUM][COL_NUM] = {
0};
// to do this:
// a1 b1 c1 d1 => a1 a2 a3 a4
// a2 b2 c2 d2 => b1 b2 b3 b4
// ...
// a4 b4 c4 d4 => d1 d2 d3 d4
// origin
int32_t i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
MT[j][i] = M[i][j];
}
}
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method1
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD in val[0]: a1 b1 c1 d1, val[1]: a2 b2 c2 d2
float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01 in val[0]: a1 a2 c1 c2, val[1]: b1 b2 d1 d2
float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02, Group by line
float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13, Group by line
vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method2
float32x4x4_t vf32x4x4fTmp1ABCD;
vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); // Take advantage of the cross read and write feature , Put in MT Array
printf("ver3:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// Only transpose in register and do not output to memory
// float fTmpABCD4x4[4][4]; // Temporary transit array
// vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); // Suppose the data to be transposed is vf32x4x4fTmpABCD
// vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
// vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
// vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
// vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); // Put the transpose result into the register
return 0;
}
At the end of the above code , With... That transposes only in registers demo, It can be used according to specific scenarios .
Summary
Method 1
Transpose in the register and output to the memory
Only operate in registers ,6 Orders , Add 4 Assignments
Method 2
- Directly realize the transpose function through the cross reading of memory and register , Command down to 3 strip .
- Cross reading between registers and memory ,5 Orders
All in all , Practice knows , Only operate the scene in the register , Law 1 better , It is more efficient than the read-write interaction between memory and registers , Even if there are oneortwo more instructions . When you need to output the results to memory , Law 2 better .
边栏推荐
- 省市区三级坐标边界数据csv转JSON
- 深度学习框架TF安装
- 新手如何入门学习PostgreSQL?
- Telerik UI 2022 R2 SP1 Retail-Not Crack
- 【JVM调优实战100例】04——方法区调优实战(上)
- Tencent cloud webshell experience
- 一行代码实现地址信息解析
- Segmenttree
- Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
- [牛客] B-完全平方数
猜你喜欢

第五篇,STM32系统定时器和通用定时器编程

迈动互联中标北京人寿保险,助推客户提升品牌价值

Lldp compatible CDP function configuration

系统休眠文件可以删除吗 系统休眠文件怎么删除

「精致店主理人」青年创业孵化营·首期顺德场圆满结束!

Building a dream in the digital era, the Xi'an station of the city chain science and Technology Strategy Summit ended smoothly

UI control telerik UI for WinForms new theme - vs2022 heuristic theme

Batch obtain the latitude coordinates of all administrative regions in China (to the county level)

BFS realizes breadth first traversal of adjacency matrix (with examples)

Part V: STM32 system timer and general timer programming
随机推荐
Periodic flash screen failure of Dell notebook
pyflink的安装和测试
Grc: personal information protection law, personal privacy, corporate risk compliance governance
mysql: error while loading shared libraries: libtinfo. so. 5: cannot open shared object file: No such
Tencent cloud webshell experience
C# 计算农历日期方法 2022
Taro 小程序开启wxml代码压缩
Let's see through the network i/o model from beginning to end
Openjudge noi 1.7 10: simple password
Installation and testing of pyflink
Tensorflow GPU installation
Realize incremental data synchronization between MySQL and ES
Gnet: notes on the use of a lightweight and high-performance go network framework
通过串口实现printf函数,中断实现串口数据接收
[Niuke] [noip2015] jumping stone
[case sharing] basic function configuration of network loop detection
【js】获取当前时间的前后n天或前后n个月(时分秒年月日都可)
[100 cases of JVM tuning practice] 04 - Method area tuning practice (Part 1)
随时随地查看远程试验数据与记录——IPEhub2与IPEmotion APP
重上吹麻滩——段芝堂创始人翟立冬游记