当前位置:网站首页>NEON优化:关于交叉存取与反向交叉存取
NEON优化:关于交叉存取与反向交叉存取
2022-07-06 17:22:00 【来知晓】
NEON优化系列文章:
背景
NEON优化过程中,经常遇到内存、存器变量间的读写,NEON内存读写指令中默认是交叉存取,部分特殊指令可以反向交叉存取。
什么是交叉存取,什么是反向交叉存取?
- 交叉读写:ld2q/3q/4q, st2q/3q/4q, zip
- 说明:按间隔(数目为2q/3q/4q中的数字)录入到相应寄存器
- 举例:如内存存储的连续数据为
a1 b1 a2 b2
,ld2q读出到2个寄存器为:val[0]: a1 a2, val[1]: b1 b2
- 反向交叉读写:ld1q/st1q, uzp
- 说明:按内存方向连续依次录入到寄存器
- 举例:如内存存储的连续数据为
a1 b1 a2 b2
,ld1q读出到1个寄存器为:val[0]: a1 b1 a2 b2
相关函数
主要从内存读写、寄存器间交互进行说明。
- 内存与寄存器交互
- ld1q/st1q
- 仅1维交叉读写与正常读写功能一致
- ld2q/st2q及3q、4q
- 作用:均为交叉读写,目的是为了处理不同声道、维度的信息
- 说明:纵向读数据,横向写数据(按列读,按行写)
- 注意:ld4q/st4q成对使用时,可还原回去,相当于对矩阵转置后放到寄存器中,再转置后放回到内存
- ld1q/st1q
- 寄存器间交互
- vzip交叉存取
- 指令:
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
- 释义:
- 输入:a = {0 1 2 3},b = {4 5 6 7}
- 输出:val[0]:0 4 1 5,val[1]:2 6 3 7
- 说明:纵向读数据,横向写数据;读是W型,写是—型。
- 具体:先读a一个数据,再读b一个数据,成04152637,横向写完val0所在4个值后(0415),再写入val1剩余4个值(2637)
- 指令:
- uzpq反向交叉存取
- 指令:
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
- 释义:类似于解交织声道数据
- 输入:a = {0 1 2 3},b = {4 5 6 7}
- 输出:val[0]:0 2 4 6,val[1]:1 3 5 7
- 说明:横向读数据,纵向写数据;读是—型,写是W型。
- 具体:a和b的值顺序读出,成01234567,将其val[0]/val[1]按列写入进去。
- 指令:
- vzip交叉存取
测试代码
ld4q/st4q功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
输出结果
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// 按列读,按行写
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// 按行读,按列写
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
小结
有了前面这些对比,所谓交叉存取、反向交叉存取,简单来看,想象一个矩阵,交叉存取就是按列读出来,按行写入到新变量,反向交叉存取则按行读出来,按列写进去,如此而已。
边栏推荐
- 动态规划思想《从入门到放弃》
- Zabbix 5.0:通过LLD方式自动化监控阿里云RDS
- Threejs image deformation enlarge full screen animation JS special effect
- [software reverse - solve flag] memory acquisition, inverse transformation operation, linear transformation, constraint solving
- Build your own website (17)
- [Niuke] [noip2015] jumping stone
- Segmenttree
- Chapter 5 DML data operation
- 【js】获取当前时间的前后n天或前后n个月(时分秒年月日都可)
- 【批处理DOS-CMD命令-汇总和小结】-字符串搜索、查找、筛选命令(find、findstr),Find和findstr的区别和辨析
猜你喜欢
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
Linear algebra of deep learning
随时随地查看远程试验数据与记录——IPEhub2与IPEmotion APP
[Niuke] b-complete square
Summary of being a microservice R & D Engineer in the past year
Dell Notebook Periodic Flash Screen Fault
第七篇,STM32串口通信编程
Deeply explore the compilation and pile insertion technology (IV. ASM exploration)
Learn to use code to generate beautiful interface documents!!!
Configuring OSPF basic functions for Huawei devices
随机推荐
Activereportsjs 3.1 Chinese version | | | activereportsjs 3.1 English version
Deep learning framework TF installation
. Bytecode structure of class file
腾讯云 WebShell 体验
【批处理DOS-CMD命令-汇总和小结】-跳转、循环、条件命令(goto、errorlevel、if、for[读取、切分、提取字符串]、)cmd命令错误汇总,cmd错误
Link sharing of STM32 development materials
批量获取中国所有行政区域经边界纬度坐标(到县区级别)
Linear algebra of deep learning
mysql: error while loading shared libraries: libtinfo. so. 5: cannot open shared object file: No such
Come on, don't spread it out. Fashion cloud secretly takes you to collect "cloud" wool, and then secretly builds a personal website to be the king of scrolls, hehe
What is time
「笔记」折半搜索(Meet in the Middle)
界面控件DevExpress WinForms皮肤编辑器的这个补丁,你了解了吗?
Anfulai embedded weekly report no. 272: 2022.06.27--2022.07.03
筑梦数字时代,城链科技战略峰会西安站顺利落幕
[software reverse - solve flag] memory acquisition, inverse transformation operation, linear transformation, constraint solving
pyflink的安装和测试
Chapter II proxy and cookies of urllib Library
Dynamic planning idea "from getting started to giving up"
mysql: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such