当前位置:网站首页>Neon Optimization: About Cross access and reverse cross access

Neon Optimization: About Cross access and reverse cross access

2022-07-07 01:14:00 To know

NEON Optimize : About cross access and reverse cross access

NEON Optimization series :

  1. NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
  2. NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
  3. NEON Optimize 3: Matrix transpose instruction optimization case ,link
  4. NEON Optimize 4:floor/ceil Optimization case of function ,link
  5. NEON Optimize 5:log10 Optimization case of function ,link
  6. NEON Optimize 6: About cross access and reverse cross access ,link
  7. NEON Optimize 7: Performance optimization experience summary ,link
  8. NEON Optimize 8: Performance optimization FAQs QA,link

background


NEON In the process of optimization , Often encounter memory 、 Read and write between memory variables ,NEON Memory read / write instructions are interleaved by default , Some special instructions can be reverse interleaved .

What is cross access , What is reverse cross access ?

  • Cross reading and writing :ld2q/3q/4q, st2q/3q/4q, zip
    • explain : At intervals ( The number is 2q/3q/4q Number in ) Enter into the corresponding register
    • give an example : For example, the continuous data stored in memory is a1 b1 a2 b2,ld2q Read to 2 The registers are :val[0]: a1 a2, val[1]: b1 b2
  • Reverse cross reading and writing :ld1q/st1q, uzp
    • explain : Input to the register successively according to the memory direction
    • give an example : For example, the continuous data stored in memory is a1 b1 a2 b2,ld1q Read to 1 The registers are :val[0]: a1 b1 a2 b2

Correlation function


Mainly read and write from memory 、 The interaction between registers is explained .

  • Memory interacts with registers
    • ld1q/st1q
      • only 1 The functions of dimension cross reading and writing are consistent with those of normal reading and writing
    • ld2q/st2q And 3q、4q
      • effect : Are cross read and write , The purpose is to deal with different channels 、 Dimension information
      • explain : Read data vertically , Write data horizontally ( Read by column , Write by line )
      • Be careful :ld4q/st4q When used in pairs , It can be restored , It is equivalent to transposing the matrix and putting it into the register , Then transpose it back to memory
  • Register to register interaction
    • vzip Cross access
      • Instructions :int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
      • paraphrase :
        • Input :a = {0 1 2 3},b = {4 5 6 7}
        • Output :val[0]:0 4 1 5,val[1]:2 6 3 7
        • explain : Read data vertically , Write data horizontally ; Reading is W type , Write is — type .
        • Specifically : First reading a A data , read b A data , become 04152637, Horizontal completion val0 Where 4 After a value (0415), Write again val1 The remaining 4 It's worth (2637)
    • uzpq Reverse cross access
      • Instructions :int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
      • paraphrase : Similar to deinterleaving channel data
        • Input :a = {0 1 2 3},b = {4 5 6 7}
        • Output :val[0]:0 2 4 6,val[1]:1 3 5 7
        • explain : Read data horizontally , Write data vertically ; Reading is — type , Write is W type .
        • Specifically :a and b The values of are read out sequentially , become 01234567, Put it val[0]/val[1] Write in by column .

Test code


ld4q/st4q A functional test

#define ROW_NUM 4
#define COL_NUM 4

// initial
float M[ROW_NUM][COL_NUM] = {
    
    {
    0, 1, 2, 3},
    {
    4, 5, 6, 7},
    {
    8, 9, 10, 11},
    {
    12, 13, 14, 15},
};

int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
    
    for (j = 0; j < COL_NUM; j++) {
    
        printf("%f ", MT[i][j]);
        MT[i][j] = 0.;
    }
    printf("\n");
}

vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
    
    for (j = 0; j < COL_NUM; j++) {
    
        printf("%f ", MT[i][j]);
        MT[i][j] = 0.;
    }
    printf("\n");
}

Output results

ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000

zip/uzp A functional test

#define ROW_NUM 4
#define COL_NUM 4

// initial
float M[ROW_NUM][COL_NUM] = {
    
    {
    0, 1, 2, 3},
    {
    4, 5, 6, 7},
    {
    8, 9, 10, 11},
    {
    12, 13, 14, 15},
};

float MT[4][4];

//  Read by column , Write by line 
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7

//  According to the line read , Write by column 
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15

printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
    
    for (j = 0; j < COL_NUM; j++) {
    
        printf("%f ", MT[i][j]);
        MT[i][j] = 0.;
    }
    printf("\n");
}

Summary


With the above comparisons , Cross access 、 Reverse cross access , Simple view , Imagine a matrix , Cross access is to read by column , Write to the new variable by line , Reverse cross access is read by line , Write it in by column , That's all .

原网站

版权声明
本文为[To know]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207061722311505.html