当前位置:网站首页>NEON优化:矩阵转置的指令优化案例
NEON优化:矩阵转置的指令优化案例
2022-07-06 17:22:00 【来知晓】
NEON优化系列文章:
背景
矩阵运算中经常用到转置操作,这里将原子矩阵4x4的转置NEON优化案例总结一下。
原始C函数负责将M[4][4]
转置为MT[4][4]
,效果如下:
M[4][4]
- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T
- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4
优化思路
有两种并行计算的方法将其转置。
- 方法1
- 用到4条指令:vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q先从内存一批交叉读入到寄存器中
- trnq利用内部两两转置功能,将部分行列进行转置
- uzpq利用解交织的读写功能,将部分行列进行转置
- 然后利用寄存器单独赋值给不同行
- st4q将寄存器结果写入到内存中
- 方法2
- 利用内存和寄存器的交叉读写关系,两条指令实现转置,不用在寄存器中倒来倒去
- ld1q实现按行读取数据到寄存器
- st4q实现交叉读取寄存器值然后写入到内存中
样例代码
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#define ROW_NUM 4
#define COL_NUM 4
int main(void)
{
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[ROW_NUM][COL_NUM] = {
0};
// to do this:
// a1 b1 c1 d1 => a1 a2 a3 a4
// a2 b2 c2 d2 => b1 b2 b3 b4
// ...
// a4 b4 c4 d4 => d1 d2 d3 d4
// origin
int32_t i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
MT[j][i] = M[i][j];
}
}
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method1
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD中val[0]: a1 b1 c1 d1, val[1]: a2 b2 c2 d2
float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01中val[0]: a1 a2 c1 c2, val[1]: b1 b2 d1 d2
float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02, 按行组合
float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13, 按行组合
vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// method2
float32x4x4_t vf32x4x4fTmp1ABCD;
vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); // 利用交叉读写特性,放入到MT数组
printf("ver3:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
// 仅在寄存器内转置不输出到内存
// float fTmpABCD4x4[4][4]; // 临时中转数组
// vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); // 假设待转置的数据为vf32x4x4fTmpABCD
// vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
// vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
// vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
// vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); // 转置结果放到寄存器中
return 0;
}
以上代码末尾,附有仅在寄存器中实现转置的demo,可根据具体场景使用。
小结
方法1
在寄存器内转置后输出到内存
只在寄存器内操作,6条指令,加4个赋值
方法2
- 直接通过内存与寄存器的交叉读取实现转置功能,命令降为3条。
- 寄存器和内存之间交叉读取实现,5条指令
总之,实践得知,只在寄存器内操作场景,法1更优,比内存和寄存器之间读写交互更高效,哪怕多了一两条指令。而需要将结果输出到内存的时候,法2更优。
边栏推荐
- windows安装mysql8(5分钟)
- gnet: 一个轻量级且高性能的 Go 网络框架 使用笔记
- 腾讯云 WebShell 体验
- 通过串口实现printf函数,中断实现串口数据接收
- "Exquisite store manager" youth entrepreneurship incubation camp - the first phase of Shunde market has been successfully completed!
- Trace tool for MySQL further implementation plan
- C Primer Plus Chapter 14 (structure and other data forms)
- Niuke cold training camp 6B (Freund has no green name level)
- Levels - UE5中的暴雨效果
- STM32开发资料链接分享
猜你喜欢
.class文件的字节码结构
做微服务研发工程师的一年来的总结
ARM裸板调试之JTAG原理
pytorch之数据类型tensor
[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error
Summary of being a microservice R & D Engineer in the past year
Installation and testing of pyflink
Part 7: STM32 serial communication programming
[software reverse - solve flag] memory acquisition, inverse transformation operation, linear transformation, constraint solving
【批处理DOS-CMD命令-汇总和小结】-字符串搜索、查找、筛选命令(find、findstr),Find和findstr的区别和辨析
随机推荐
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
Chenglian premium products has completed the first step to enter the international capital market by taking shares in halber international
Slam d'attention: un slam visuel monoculaire appris de l'attention humaine
boot - prometheus-push gateway 使用
Part VI, STM32 pulse width modulation (PWM) programming
Trace tool for MySQL further implementation plan
UI control telerik UI for WinForms new theme - vs2022 heuristic theme
【批处理DOS-CMD命令-汇总和小结】-查看或修改文件属性(ATTRIB),查看、修改文件关联类型(assoc、ftype)
Set (generic & list & Set & custom sort)
gnet: 一个轻量级且高性能的 Go 网络框架 使用笔记
[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr
Come on, don't spread it out. Fashion cloud secretly takes you to collect "cloud" wool, and then secretly builds a personal website to be the king of scrolls, hehe
斗地主游戏的案例开发
[user defined type] structure, union, enumeration
[Niuke classic question 01] bit operation
动态规划思想《从入门到放弃》
资产安全问题或制约加密行业发展 风控+合规成为平台破局关键
Do you understand this patch of the interface control devaxpress WinForms skin editor?
Make a simple graphical interface with Tkinter
STM32开发资料链接分享