当前位置:网站首页>Neon Optimization: About Cross access and reverse cross access
Neon Optimization: About Cross access and reverse cross access
2022-07-07 01:14:00 【To know】
NEON Optimize : About cross access and reverse cross access
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
NEON In the process of optimization , Often encounter memory 、 Read and write between memory variables ,NEON Memory read / write instructions are interleaved by default , Some special instructions can be reverse interleaved .
What is cross access , What is reverse cross access ?
- Cross reading and writing :ld2q/3q/4q, st2q/3q/4q, zip
- explain : At intervals ( The number is 2q/3q/4q Number in ) Enter into the corresponding register
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2
,ld2q Read to 2 The registers are :val[0]: a1 a2, val[1]: b1 b2
- Reverse cross reading and writing :ld1q/st1q, uzp
- explain : Input to the register successively according to the memory direction
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2
,ld1q Read to 1 The registers are :val[0]: a1 b1 a2 b2
Correlation function
Mainly read and write from memory 、 The interaction between registers is explained .
- Memory interacts with registers
- ld1q/st1q
- only 1 The functions of dimension cross reading and writing are consistent with those of normal reading and writing
- ld2q/st2q And 3q、4q
- effect : Are cross read and write , The purpose is to deal with different channels 、 Dimension information
- explain : Read data vertically , Write data horizontally ( Read by column , Write by line )
- Be careful :ld4q/st4q When used in pairs , It can be restored , It is equivalent to transposing the matrix and putting it into the register , Then transpose it back to memory
- ld1q/st1q
- Register to register interaction
- vzip Cross access
- Instructions :
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
- paraphrase :
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 4 1 5,val[1]:2 6 3 7
- explain : Read data vertically , Write data horizontally ; Reading is W type , Write is — type .
- Specifically : First reading a A data , read b A data , become 04152637, Horizontal completion val0 Where 4 After a value (0415), Write again val1 The remaining 4 It's worth (2637)
- Instructions :
- uzpq Reverse cross access
- Instructions :
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
- paraphrase : Similar to deinterleaving channel data
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 2 4 6,val[1]:1 3 5 7
- explain : Read data horizontally , Write data vertically ; Reading is — type , Write is W type .
- Specifically :a and b The values of are read out sequentially , become 01234567, Put it val[0]/val[1] Write in by column .
- Instructions :
- vzip Cross access
Test code
ld4q/st4q A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Output results
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// Read by column , Write by line
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// According to the line read , Write by column
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Summary
With the above comparisons , Cross access 、 Reverse cross access , Simple view , Imagine a matrix , Cross access is to read by column , Write to the new variable by line , Reverse cross access is read by line , Write it in by column , That's all .
边栏推荐
- JTAG debugging experience of arm bare board debugging
- C# 计算农历日期方法 2022
- 系统休眠文件可以删除吗 系统休眠文件怎么删除
- Atomic in golang, and cas Operations
- 【案例分享】网络环路检测基本功能配置
- 第五篇,STM32系统定时器和通用定时器编程
- Grc: personal information protection law, personal privacy, corporate risk compliance governance
- 再聊聊我常用的15个数据源网站
- Realize incremental data synchronization between MySQL and ES
- Return to blowing marshland -- travel notes of zhailidong, founder of duanzhitang
猜你喜欢
HMM 笔记
Let's see through the network i/o model from beginning to end
迈动互联中标北京人寿保险,助推客户提升品牌价值
[force buckle]41 Missing first positive number
Dell笔记本周期性闪屏故障
[牛客] [NOIP2015]跳石头
[Niuke classic question 01] bit operation
Maidong Internet won the bid of Beijing life insurance to boost customers' brand value
windows安装mysql8(5分钟)
Part 7: STM32 serial communication programming
随机推荐
Part 7: STM32 serial communication programming
Anfulai embedded weekly report no. 272: 2022.06.27--2022.07.03
腾讯云 WebShell 体验
[Niuke] [noip2015] jumping stone
Installation of torch and torch vision in pytorch
【JVM调优实战100例】04——方法区调优实战(上)
Part V: STM32 system timer and general timer programming
资产安全问题或制约加密行业发展 风控+合规成为平台破局关键
[hfctf2020]babyupload session parsing engine
Maidong Internet won the bid of Beijing life insurance to boost customers' brand value
NEON优化:性能优化常见问题QA
pyflink的安装和测试
BFS realizes breadth first traversal of adjacency matrix (with examples)
Implementation principle of waitgroup in golang
Tensorflow 1.14 specify GPU running settings
How to evaluate load balancing performance parameters?
golang中的atomic,以及CAS操作
HMM notes
Tensorflow GPU installation
Failed to successfully launch or connect to a child MSBuild. exe process. Verify that the MSBuild. exe