当前位置:网站首页>NEON优化:关于交叉存取与反向交叉存取
NEON优化:关于交叉存取与反向交叉存取
2022-07-06 17:22:00 【来知晓】
NEON优化系列文章:
背景
NEON优化过程中,经常遇到内存、存器变量间的读写,NEON内存读写指令中默认是交叉存取,部分特殊指令可以反向交叉存取。
什么是交叉存取,什么是反向交叉存取?
- 交叉读写:ld2q/3q/4q, st2q/3q/4q, zip
- 说明:按间隔(数目为2q/3q/4q中的数字)录入到相应寄存器
- 举例:如内存存储的连续数据为
a1 b1 a2 b2,ld2q读出到2个寄存器为:val[0]: a1 a2, val[1]: b1 b2
- 反向交叉读写:ld1q/st1q, uzp
- 说明:按内存方向连续依次录入到寄存器
- 举例:如内存存储的连续数据为
a1 b1 a2 b2,ld1q读出到1个寄存器为:val[0]: a1 b1 a2 b2
相关函数
主要从内存读写、寄存器间交互进行说明。
- 内存与寄存器交互
- ld1q/st1q
- 仅1维交叉读写与正常读写功能一致
- ld2q/st2q及3q、4q
- 作用:均为交叉读写,目的是为了处理不同声道、维度的信息
- 说明:纵向读数据,横向写数据(按列读,按行写)
- 注意:ld4q/st4q成对使用时,可还原回去,相当于对矩阵转置后放到寄存器中,再转置后放回到内存
- ld1q/st1q
- 寄存器间交互
- vzip交叉存取
- 指令:
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b); - 释义:
- 输入:a = {0 1 2 3},b = {4 5 6 7}
- 输出:val[0]:0 4 1 5,val[1]:2 6 3 7
- 说明:纵向读数据,横向写数据;读是W型,写是—型。
- 具体:先读a一个数据,再读b一个数据,成04152637,横向写完val0所在4个值后(0415),再写入val1剩余4个值(2637)
- 指令:
- uzpq反向交叉存取
- 指令:
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b); - 释义:类似于解交织声道数据
- 输入:a = {0 1 2 3},b = {4 5 6 7}
- 输出:val[0]:0 2 4 6,val[1]:1 3 5 7
- 说明:横向读数据,纵向写数据;读是—型,写是W型。
- 具体:a和b的值顺序读出,成01234567,将其val[0]/val[1]按列写入进去。
- 指令:
- vzip交叉存取
测试代码
ld4q/st4q功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
输出结果
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp功能测试
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// 按列读,按行写
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// 按行读,按列写
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
小结
有了前面这些对比,所谓交叉存取、反向交叉存取,简单来看,想象一个矩阵,交叉存取就是按列读出来,按行写入到新变量,反向交叉存取则按行读出来,按列写进去,如此而已。
边栏推荐
- Cause of handler memory leak
- Dell筆記本周期性閃屏故障
- Telerik UI 2022 R2 SP1 Retail-Not Crack
- [Niuke] [noip2015] jumping stone
- [batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error
- OSPF configuration command of Huawei equipment
- Niuke cold training camp 6B (Freund has no green name level)
- [batch dos-cmd command - summary and summary] - view or modify file attributes (attrib), view and modify file association types (Assoc, ftype)
- pyflink的安装和测试
- 【JokerのZYNQ7020】AXI_ EMC。
猜你喜欢

Come on, don't spread it out. Fashion cloud secretly takes you to collect "cloud" wool, and then secretly builds a personal website to be the king of scrolls, hehe

Distributed cache

. Bytecode structure of class file

力扣1037. 有效的回旋镖

Batch obtain the latitude coordinates of all administrative regions in China (to the county level)

用tkinter做一个简单图形界面
![[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr](/img/4a/0dcc28f76ce99982f930c21d0d76c3.png)
[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr
![[C language] dynamic address book](/img/e7/ca1030a1099fe1f59f5d8dd722fdb7.jpg)
[C language] dynamic address book

【批處理DOS-CMD命令-匯總和小結】-字符串搜索、查找、篩選命令(find、findstr),Find和findstr的區別和辨析

Slam d'attention: un slam visuel monoculaire appris de l'attention humaine
随机推荐
How do novices get started and learn PostgreSQL?
[software reverse - solve flag] memory acquisition, inverse transformation operation, linear transformation, constraint solving
C9 colleges and universities, doctoral students make a statement of nature!
Summary of being a microservice R & D Engineer in the past year
Trace tool for MySQL further implementation plan
深度学习简史(一)
Batch obtain the latitude coordinates of all administrative regions in China (to the county level)
Link sharing of STM32 development materials
筑梦数字时代,城链科技战略峰会西安站顺利落幕
Informatics Olympiad YBT 1171: factors of large integers | 1.6 13: factors of large integers
windows安装mysql8(5分钟)
A brief history of deep learning (I)
[batch dos-cmd command - summary and summary] - view or modify file attributes (attrib), view and modify file association types (Assoc, ftype)
Periodic flash screen failure of Dell notebook
新手如何入门学习PostgreSQL?
Dynamic planning idea "from getting started to giving up"
省市区三级坐标边界数据csv转JSON
Equals() and hashcode()
Learn self 3D representation like ray tracing ego3rt
Part IV: STM32 interrupt control programming