当前位置:网站首页>NEON优化:矩阵转置的指令优化案例
NEON优化:矩阵转置的指令优化案例
2022-07-06 17:22:00 【来知晓】
NEON优化系列文章:
背景
矩阵运算中经常用到转置操作,这里将原子矩阵4x4的转置NEON优化案例总结一下。
原始C函数负责将M[4][4]
转置为MT[4][4]
,效果如下:
M[4][4]
- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T
- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4
优化思路
有两种并行计算的方法将其转置。
- 方法1
- 用到4条指令:vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q先从内存一批交叉读入到寄存器中
- trnq利用内部两两转置功能,将部分行列进行转置
- uzpq利用解交织的读写功能,将部分行列进行转置
- 然后利用寄存器单独赋值给不同行
- st4q将寄存器结果写入到内存中
- 方法2
- 利用内存和寄存器的交叉读写关系,两条指令实现转置,不用在寄存器中倒来倒去
- ld1q实现按行读取数据到寄存器
- st4q实现交叉读取寄存器值然后写入到内存中
样例代码
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#define ROW_NUM 4
#define COL_NUM 4
int main(void)
{
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[ROW_NUM][COL_NUM] = {
0};
// to do this:
// a1 b1 c1 d1 => a1 a2 a3 a4
// a2 b2 c2 d2 => b1 b2 b3 b4
// ...
// a4 b4 c4 d4 => d1 d2 d3 d4
// origin
int32_t i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
MT[j][i] = M[i][j];
}
}
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method1
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD中val[0]: a1 b1 c1 d1, val[1]: a2 b2 c2 d2
float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01中val[0]: a1 a2 c1 c2, val[1]: b1 b2 d1 d2
float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02, 按行组合
float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13, 按行组合
vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method2
float32x4x4_t vf32x4x4fTmp1ABCD;
vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); // 利用交叉读写特性,放入到MT数组
printf("ver3:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// 仅在寄存器内转置不输出到内存
// float fTmpABCD4x4[4][4]; // 临时中转数组
// vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); // 假设待转置的数据为vf32x4x4fTmpABCD
// vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
// vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
// vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
// vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); // 转置结果放到寄存器中
return 0;
}
以上代码末尾,附有仅在寄存器中实现转置的demo,可根据具体场景使用。
小结
方法1
在寄存器内转置后输出到内存
只在寄存器内操作,6条指令,加4个赋值
方法2
- 直接通过内存与寄存器的交叉读取实现转置功能,命令降为3条。
- 寄存器和内存之间交叉读取实现,5条指令
总之,实践得知,只在寄存器内操作场景,法1更优,比内存和寄存器之间读写交互更高效,哪怕多了一两条指令。而需要将结果输出到内存的时候,法2更优。
边栏推荐
- mysql: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such
- [software reverse - solve flag] memory acquisition, inverse transformation operation, linear transformation, constraint solving
- SuperSocket 1.6 创建一个简易的报文长度在头部的Socket服务器
- Deep learning environment configuration jupyter notebook
- Chapter 5 DML data operation
- 新手如何入门学习PostgreSQL?
- 重上吹麻滩——段芝堂创始人翟立冬游记
- Attention SLAM:一種從人類注意中學習的視覺單目SLAM
- Maidong Internet won the bid of Beijing life insurance to boost customers' brand value
- 【批处理DOS-CMD命令-汇总和小结】-查看或修改文件属性(ATTRIB),查看、修改文件关联类型(assoc、ftype)
猜你喜欢
城联优品入股浩柏国际进军国际资本市场,已完成第一步
【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析
Learn self 3D representation like ray tracing ego3rt
Provincial and urban level three coordinate boundary data CSV to JSON
资产安全问题或制约加密行业发展 风控+合规成为平台破局关键
《安富莱嵌入式周报》第272期:2022.06.27--2022.07.03
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
Configuring the stub area of OSPF for Huawei devices
[HFCTF2020]BabyUpload session解析引擎
Attention SLAM:一種從人類注意中學習的視覺單目SLAM
随机推荐
再聊聊我常用的15个数据源网站
Telerik UI 2022 R2 SP1 Retail-Not Crack
Chenglian premium products has completed the first step to enter the international capital market by taking shares in halber international
gnet: 一个轻量级且高性能的 Go 网络框架 使用笔记
Dell筆記本周期性閃屏故障
Deep understanding of distributed cache design
[batch dos-cmd command - summary and summary] - string search, search, and filter commands (find, findstr), and the difference and discrimination between find and findstr
Five different code similarity detection and the development trend of code similarity detection
What is time
Anfulai embedded weekly report no. 272: 2022.06.27--2022.07.03
pyflink的安装和测试
Openjudge noi 1.7 08: character substitution
Provincial and urban level three coordinate boundary data CSV to JSON
from . cv2 import * ImportError: libGL. so. 1: cannot open shared object file: No such file or direc
代码克隆的优缺点
MySQL中回表的代价
【批处理DOS-CMD命令-汇总和小结】-字符串搜索、查找、筛选命令(find、findstr),Find和findstr的区别和辨析
Dell笔记本周期性闪屏故障
[software reverse - solve flag] memory acquisition, inverse transformation operation, linear transformation, constraint solving
Part 7: STM32 serial communication programming