当前位置:网站首页>NEON优化:矩阵转置的指令优化案例
NEON优化:矩阵转置的指令优化案例
2022-07-06 17:22:00 【来知晓】
NEON优化系列文章:
背景
矩阵运算中经常用到转置操作,这里将原子矩阵4x4的转置NEON优化案例总结一下。
原始C函数负责将M[4][4]转置为MT[4][4],效果如下:
M[4][4]- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4
优化思路
有两种并行计算的方法将其转置。
- 方法1
- 用到4条指令:vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q先从内存一批交叉读入到寄存器中
- trnq利用内部两两转置功能,将部分行列进行转置
- uzpq利用解交织的读写功能,将部分行列进行转置
- 然后利用寄存器单独赋值给不同行
- st4q将寄存器结果写入到内存中
- 方法2
- 利用内存和寄存器的交叉读写关系,两条指令实现转置,不用在寄存器中倒来倒去
- ld1q实现按行读取数据到寄存器
- st4q实现交叉读取寄存器值然后写入到内存中
样例代码
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#define ROW_NUM 4
#define COL_NUM 4
int main(void)
{
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[ROW_NUM][COL_NUM] = {
0};
// to do this:
// a1 b1 c1 d1 => a1 a2 a3 a4
// a2 b2 c2 d2 => b1 b2 b3 b4
// ...
// a4 b4 c4 d4 => d1 d2 d3 d4
// origin
int32_t i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
MT[j][i] = M[i][j];
}
}
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method1
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD中val[0]: a1 b1 c1 d1, val[1]: a2 b2 c2 d2
float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01中val[0]: a1 a2 c1 c2, val[1]: b1 b2 d1 d2
float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02, 按行组合
float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13, 按行组合
vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method2
float32x4x4_t vf32x4x4fTmp1ABCD;
vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); // 利用交叉读写特性,放入到MT数组
printf("ver3:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// 仅在寄存器内转置不输出到内存
// float fTmpABCD4x4[4][4]; // 临时中转数组
// vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); // 假设待转置的数据为vf32x4x4fTmpABCD
// vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
// vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
// vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
// vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); // 转置结果放到寄存器中
return 0;
}
以上代码末尾,附有仅在寄存器中实现转置的demo,可根据具体场景使用。
小结
方法1
在寄存器内转置后输出到内存
只在寄存器内操作,6条指令,加4个赋值
方法2
- 直接通过内存与寄存器的交叉读取实现转置功能,命令降为3条。
- 寄存器和内存之间交叉读取实现,5条指令
总之,实践得知,只在寄存器内操作场景,法1更优,比内存和寄存器之间读写交互更高效,哪怕多了一两条指令。而需要将结果输出到内存的时候,法2更优。
边栏推荐
- fastDFS数据迁移操作记录
- Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
- Rainstorm effect in levels - ue5
- [batch dos-cmd command - summary and summary] - view or modify file attributes (attrib), view and modify file association types (Assoc, ftype)
- Part V: STM32 system timer and general timer programming
- tensorflow 1.14指定gpu运行设置
- 迈动互联中标北京人寿保险,助推客户提升品牌价值
- Link sharing of STM32 development materials
- 省市区三级坐标边界数据csv转JSON
- Mongodb client operation (mongorepository)
猜你喜欢

Part VI, STM32 pulse width modulation (PWM) programming

Trace tool for MySQL further implementation plan
![Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax](/img/87/3fee9e6f687b0c3efe7208a25f07f1.png)
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax

资产安全问题或制约加密行业发展 风控+合规成为平台破局关键
做微服务研发工程师的一年来的总结

Dell Notebook Periodic Flash Screen Fault

Chapter II proxy and cookies of urllib Library

阿里云中mysql数据库被攻击了,最终数据找回来了
![[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error](/img/a5/41d4cbc070d421093323dc189a05cf.png)
[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error
![[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr](/img/4a/0dcc28f76ce99982f930c21d0d76c3.png)
[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr
随机推荐
String comparison in batch file - string comparison in batch file
第七篇,STM32串口通信编程
BFS realizes breadth first traversal of adjacency matrix (with examples)
Return to blowing marshland -- travel notes of zhailidong, founder of duanzhitang
【批处理DOS-CMD命令-汇总和小结】-查看或修改文件属性(ATTRIB),查看、修改文件关联类型(assoc、ftype)
[Niuke] b-complete square
mysql: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such
Telerik UI 2022 R2 SP1 Retail-Not Crack
Data type of pytorch tensor
Dynamic planning idea "from getting started to giving up"
LLDP兼容CDP功能配置
[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error
ARM裸板调试之JTAG原理
力扣1037. 有效的回旋镖
筑梦数字时代,城链科技战略峰会西安站顺利落幕
界面控件DevExpress WinForms皮肤编辑器的这个补丁,你了解了吗?
Batch obtain the latitude coordinates of all administrative regions in China (to the county level)
[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr
Deep understanding of distributed cache design
Eventbus source code analysis