当前位置:网站首页>NEON优化:关于交叉存取与反向交叉存取
NEON优化:关于交叉存取与反向交叉存取
2022-07-06 17:22:00 【来知晓】
NEON优化系列文章:
背景
NEON优化过程中,经常遇到内存、存器变量间的读写,NEON内存读写指令中默认是交叉存取,部分特殊指令可以反向交叉存取。
什么是交叉存取,什么是反向交叉存取?
- 交叉读写:ld2q/3q/4q, st2q/3q/4q, zip
- 说明:按间隔(数目为2q/3q/4q中的数字)录入到相应寄存器
- 举例:如内存存储的连续数据为
a1 b1 a2 b2
,ld2q读出到2个寄存器为:val[0]: a1 a2, val[1]: b1 b2
- 反向交叉读写:ld1q/st1q, uzp
- 说明:按内存方向连续依次录入到寄存器
- 举例:如内存存储的连续数据为
a1 b1 a2 b2
,ld1q读出到1个寄存器为:val[0]: a1 b1 a2 b2
相关函数
主要从内存读写、寄存器间交互进行说明。
- 内存与寄存器交互
- ld1q/st1q
- 仅1维交叉读写与正常读写功能一致
- ld2q/st2q及3q、4q
- 作用:均为交叉读写,目的是为了处理不同声道、维度的信息
- 说明:纵向读数据,横向写数据(按列读,按行写)
- 注意:ld4q/st4q成对使用时,可还原回去,相当于对矩阵转置后放到寄存器中,再转置后放回到内存
- ld1q/st1q
- 寄存器间交互
- vzip交叉存取
- 指令:
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
- 释义:
- 输入:a = {0 1 2 3},b = {4 5 6 7}
- 输出:val[0]:0 4 1 5,val[1]:2 6 3 7
- 说明:纵向读数据,横向写数据;读是W型,写是—型。
- 具体:先读a一个数据,再读b一个数据,成04152637,横向写完val0所在4个值后(0415),再写入val1剩余4个值(2637)
- 指令:
- uzpq反向交叉存取
- 指令:
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
- 释义:类似于解交织声道数据
- 输入:a = {0 1 2 3},b = {4 5 6 7}
- 输出:val[0]:0 2 4 6,val[1]:1 3 5 7
- 说明:横向读数据,纵向写数据;读是—型,写是W型。
- 具体:a和b的值顺序读出,成01234567,将其val[0]/val[1]按列写入进去。
- 指令:
- vzip交叉存取
测试代码
ld4q/st4q功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
输出结果
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// 按列读,按行写
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// 按行读,按列写
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
小结
有了前面这些对比,所谓交叉存取、反向交叉存取,简单来看,想象一个矩阵,交叉存取就是按列读出来,按行写入到新变量,反向交叉存取则按行读出来,按列写进去,如此而已。
边栏推荐
- Deep learning environment configuration jupyter notebook
- Interface (interface related meaning, different abstract classes, interface callback)
- Five different code similarity detection and the development trend of code similarity detection
- In rails, when the resource creation operation fails and render: new is called, why must the URL be changed to the index URL of the resource?
- 第五篇,STM32系统定时器和通用定时器编程
- 阿里云中mysql数据库被攻击了,最终数据找回来了
- 【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析
- 【批处理DOS-CMD命令-汇总和小结】-查看或修改文件属性(ATTRIB),查看、修改文件关联类型(assoc、ftype)
- Windows installation mysql8 (5 minutes)
- String comparison in batch file - string comparison in batch file
猜你喜欢
[牛客] B-完全平方数
Dynamic planning idea "from getting started to giving up"
Linear algebra of deep learning
[Niuke] b-complete square
"Exquisite store manager" youth entrepreneurship incubation camp - the first phase of Shunde market has been successfully completed!
New feature of Oracle 19C: automatic DML redirection of ADG, enhanced read-write separation -- ADG_ REDIRECT_ DML
Set (generic & list & Set & custom sort)
Data processing of deep learning
LLDP兼容CDP功能配置
Five different code similarity detection and the development trend of code similarity detection
随机推荐
golang中的WaitGroup实现原理
Dell笔记本周期性闪屏故障
tensorflow 1.14指定gpu运行设置
Linear algebra of deep learning
Chapter 5 DML data operation
随时随地查看远程试验数据与记录——IPEhub2与IPEmotion APP
一行代码实现地址信息解析
用tkinter做一个简单图形界面
Set (generic & list & Set & custom sort)
Do you understand this patch of the interface control devaxpress WinForms skin editor?
Deeply explore the compilation and pile insertion technology (IV. ASM exploration)
New feature of Oracle 19C: automatic DML redirection of ADG, enhanced read-write separation -- ADG_ REDIRECT_ DML
【批处理DOS-CMD命令-汇总和小结】-查看或修改文件属性(ATTRIB),查看、修改文件关联类型(assoc、ftype)
【JVM调优实战100例】05——方法区调优实战(下)
Installation of torch and torch vision in pytorch
动态规划思想《从入门到放弃》
[hfctf2020]babyupload session parsing engine
ARM裸板调试之JTAG调试体验
Niuke cold training camp 6B (Freund has no green name level)
Tencent cloud webshell experience