当前位置:网站首页>Neon Optimization: an instruction optimization case of matrix transpose
Neon Optimization: an instruction optimization case of matrix transpose
2022-07-07 01:14:00 【To know】
NEON Optimize : Matrix transpose instruction optimization case
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
Transpose operation is often used in matrix operation , Here, the atomic matrix 4x4 The transpose NEON Summary of optimization cases .
original C The function is responsible for M[4][4]
Transpose to MT[4][4]
, The effect is as follows :
M[4][4]
- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T
- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4
Optimization idea
There are two parallel computing methods to transpose it .
- Method 1
- be used 4 Orders :vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q First, a batch of cross reads from memory into registers
- trnq Use the internal binary transpose function , Transpose some rows and columns
- uzpq Use the deinterleaving read-write function , Transpose some rows and columns
- Then use registers to assign values to different rows
- st4q Write the register result to memory
- Method 2
- Use the cross read-write relationship between memory and registers , Two instructions realize transpose , Don't flip around in the register
- ld1q Realize reading data to register by line
- st4q Cross read register values and write them to memory
Sample code
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#define ROW_NUM 4
#define COL_NUM 4
int main(void)
{
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[ROW_NUM][COL_NUM] = {
0};
// to do this:
// a1 b1 c1 d1 => a1 a2 a3 a4
// a2 b2 c2 d2 => b1 b2 b3 b4
// ...
// a4 b4 c4 d4 => d1 d2 d3 d4
// origin
int32_t i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
MT[j][i] = M[i][j];
}
}
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method1
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD in val[0]: a1 b1 c1 d1, val[1]: a2 b2 c2 d2
float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01 in val[0]: a1 a2 c1 c2, val[1]: b1 b2 d1 d2
float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02, Group by line
float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13, Group by line
vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method2
float32x4x4_t vf32x4x4fTmp1ABCD;
vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); // Take advantage of the cross read and write feature , Put in MT Array
printf("ver3:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// Only transpose in register and do not output to memory
// float fTmpABCD4x4[4][4]; // Temporary transit array
// vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); // Suppose the data to be transposed is vf32x4x4fTmpABCD
// vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
// vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
// vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
// vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); // Put the transpose result into the register
return 0;
}
At the end of the above code , With... That transposes only in registers demo, It can be used according to specific scenarios .
Summary
Method 1
Transpose in the register and output to the memory
Only operate in registers ,6 Orders , Add 4 Assignments
Method 2
- Directly realize the transpose function through the cross reading of memory and register , Command down to 3 strip .
- Cross reading between registers and memory ,5 Orders
All in all , Practice knows , Only operate the scene in the register , Law 1 better , It is more efficient than the read-write interaction between memory and registers , Even if there are oneortwo more instructions . When you need to output the results to memory , Law 2 better .
边栏推荐
- Address information parsing in one line of code
- 接收用户输入,身高BMI体重指数检测小业务入门案例
- 一行代码实现地址信息解析
- [force buckle]41 Missing first positive number
- Eventbus source code analysis
- Tencent cloud webshell experience
- A brief history of deep learning (II)
- 详解OpenCV的矩阵规范化函数normalize()【范围化矩阵的范数或值范围(归一化处理)】,并附NORM_MINMAX情况下的示例代码
- Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
- [batch dos-cmd command - summary and summary] - view or modify file attributes (attrib), view and modify file association types (Assoc, ftype)
猜你喜欢
Building a dream in the digital era, the Xi'an station of the city chain science and Technology Strategy Summit ended smoothly
pyflink的安装和测试
随时随地查看远程试验数据与记录——IPEhub2与IPEmotion APP
Installation and testing of pyflink
[batch dos-cmd command - summary and summary] - string search, search, and filter commands (find, findstr), and the difference and discrimination between find and findstr
【JVM调优实战100例】04——方法区调优实战(上)
做微服务研发工程师的一年来的总结
pytorch之数据类型tensor
Chenglian premium products has completed the first step to enter the international capital market by taking shares in halber international
让我们,从头到尾,通透网络I/O模型
随机推荐
「笔记」折半搜索(Meet in the Middle)
Mongodb client operation (mongorepository)
【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析
通过串口实现printf函数,中断实现串口数据接收
Maidong Internet won the bid of Beijing life insurance to boost customers' brand value
HMM 笔记
树莓派/arm设备上安装火狐Firefox浏览器
Informatics Orsay Ibn YBT 1172: find the factorial of n within 10000 | 1.6 14: find the factorial of n within 10000
Grc: personal information protection law, personal privacy, corporate risk compliance governance
In rails, when the resource creation operation fails and render: new is called, why must the URL be changed to the index URL of the resource?
Asset security issues or constraints on the development of the encryption industry, risk control + compliance has become the key to breaking the platform
[Niuke] [noip2015] jumping stone
随时随地查看远程试验数据与记录——IPEhub2与IPEmotion APP
第四篇,STM32中断控制编程
Make a simple graphical interface with Tkinter
[force buckle]41 Missing first positive number
[case sharing] basic function configuration of network loop detection
What kind of experience is it to realize real-time collaboration in jupyter
windows安装mysql8(5分钟)
The MySQL database in Alibaba cloud was attacked, and finally the data was found