当前位置:网站首页>Neon Optimization: About Cross access and reverse cross access
Neon Optimization: About Cross access and reverse cross access
2022-07-07 01:14:00 【To know】
NEON Optimize : About cross access and reverse cross access
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
NEON In the process of optimization , Often encounter memory 、 Read and write between memory variables ,NEON Memory read / write instructions are interleaved by default , Some special instructions can be reverse interleaved .
What is cross access , What is reverse cross access ?
- Cross reading and writing :ld2q/3q/4q, st2q/3q/4q, zip
- explain : At intervals ( The number is 2q/3q/4q Number in ) Enter into the corresponding register
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2
,ld2q Read to 2 The registers are :val[0]: a1 a2, val[1]: b1 b2
- Reverse cross reading and writing :ld1q/st1q, uzp
- explain : Input to the register successively according to the memory direction
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2
,ld1q Read to 1 The registers are :val[0]: a1 b1 a2 b2
Correlation function
Mainly read and write from memory 、 The interaction between registers is explained .
- Memory interacts with registers
- ld1q/st1q
- only 1 The functions of dimension cross reading and writing are consistent with those of normal reading and writing
- ld2q/st2q And 3q、4q
- effect : Are cross read and write , The purpose is to deal with different channels 、 Dimension information
- explain : Read data vertically , Write data horizontally ( Read by column , Write by line )
- Be careful :ld4q/st4q When used in pairs , It can be restored , It is equivalent to transposing the matrix and putting it into the register , Then transpose it back to memory
- ld1q/st1q
- Register to register interaction
- vzip Cross access
- Instructions :
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
- paraphrase :
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 4 1 5,val[1]:2 6 3 7
- explain : Read data vertically , Write data horizontally ; Reading is W type , Write is — type .
- Specifically : First reading a A data , read b A data , become 04152637, Horizontal completion val0 Where 4 After a value (0415), Write again val1 The remaining 4 It's worth (2637)
- Instructions :
- uzpq Reverse cross access
- Instructions :
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
- paraphrase : Similar to deinterleaving channel data
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 2 4 6,val[1]:1 3 5 7
- explain : Read data horizontally , Write data vertically ; Reading is — type , Write is W type .
- Specifically :a and b The values of are read out sequentially , become 01234567, Put it val[0]/val[1] Write in by column .
- Instructions :
- vzip Cross access
Test code
ld4q/st4q A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Output results
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// Read by column , Write by line
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// According to the line read , Write by column
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Summary
With the above comparisons , Cross access 、 Reverse cross access , Simple view , Imagine a matrix , Cross access is to read by column , Write to the new variable by line , Reverse cross access is read by line , Write it in by column , That's all .
边栏推荐
- Install Firefox browser on raspberry pie /arm device
- 批量获取中国所有行政区域经边界纬度坐标(到县区级别)
- 筑梦数字时代,城链科技战略峰会西安站顺利落幕
- LLDP兼容CDP功能配置
- A brief history of deep learning (II)
- tensorflow 1.14指定gpu运行设置
- The difference between spin and sleep
- The MySQL database in Alibaba cloud was attacked, and finally the data was found
- 线段树(SegmentTree)
- [Niuke classic question 01] bit operation
猜你喜欢
批量获取中国所有行政区域经边界纬度坐标(到县区级别)
[100 cases of JVM tuning practice] 05 - Method area tuning practice (Part 2)
HMM notes
Return to blowing marshland -- travel notes of zhailidong, founder of duanzhitang
Make a simple graphical interface with Tkinter
[HFCTF2020]BabyUpload session解析引擎
第七篇,STM32串口通信编程
【案例分享】网络环路检测基本功能配置
城联优品入股浩柏国际进军国际资本市场,已完成第一步
[case sharing] basic function configuration of network loop detection
随机推荐
系统休眠文件可以删除吗 系统休眠文件怎么删除
Provincial and urban level three coordinate boundary data CSV to JSON
Let's talk about 15 data source websites I often use
「精致店主理人」青年创业孵化营·首期顺德场圆满结束!
from . cv2 import * ImportError: libGL. so. 1: cannot open shared object file: No such file or direc
批量获取中国所有行政区域经边界纬度坐标(到县区级别)
[100 cases of JVM tuning practice] 04 - Method area tuning practice (Part 1)
Build your own website (17)
Failed to successfully launch or connect to a child MSBuild. exe process. Verify that the MSBuild. exe
mysql: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such
Supersocket 1.6 creates a simple socket server with message length in the header
[Niuke] [noip2015] jumping stone
Lldp compatible CDP function configuration
筑梦数字时代,城链科技战略峰会西安站顺利落幕
NEON优化:矩阵转置的指令优化案例
What are the differences between Oracle Linux and CentOS?
[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error
Dell Notebook Periodic Flash Screen Fault
Rainstorm effect in levels - ue5
STM32开发资料链接分享