当前位置：网站首页>Neon Optimization: an instruction optimization case of matrix transpose

Neon Optimization: an instruction optimization case of matrix transpose

2022-07-07 01:14:00 【To know】

NEON Optimize ： Matrix transpose instruction optimization case

NEON Optimization series ：
NEON Optimize 1： Software performance optimization 、 How to reduce power consumption ？link
NEON Optimize 2：ARM Summary of optimized high frequency instructions , link
NEON Optimize 3： Matrix transpose instruction optimization case ,link
NEON Optimize 4：floor/ceil Optimization case of function ,link
NEON Optimize 5：log10 Optimization case of function ,link
NEON Optimize 6： About cross access and reverse cross access ,link
NEON Optimize 7： Performance optimization experience summary ,link
NEON Optimize 8： Performance optimization FAQs QA,link

background

Transpose operation is often used in matrix operation , Here, the atomic matrix 4x4 The transpose NEON Summary of optimization cases .

original C The function is responsible for M[4][4] Transpose to MT[4][4], The effect is as follows ：

M[4][4]
- a1, b1, c1, d1
- a2, b2, c2, d2
- a3, b3, c3, d3
- a4, b4, c4, d4
M[4][4]^T
- a1, a2, a3, a4
- b1, b2, b3, b4
- c1, c2, c3, c4
- d1, d2, d3, d4

Optimization idea

There are two parallel computing methods to transpose it .

Method 1
- be used 4 Orders ：vld4q_f32/vtrnq_f32/vuzpq_f32/vst4q_f32
- ld4q First, a batch of cross reads from memory into registers
- trnq Use the internal binary transpose function , Transpose some rows and columns
- uzpq Use the deinterleaving read-write function , Transpose some rows and columns
- Then use registers to assign values to different rows
- st4q Write the register result to memory
Method 2
- Use the cross read-write relationship between memory and registers , Two instructions realize transpose , Don't flip around in the register
- ld1q Realize reading data to register by line
- st4q Cross read register values and write them to memory

Sample code

#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>

#define ROW_NUM 4
#define COL_NUM 4

int main(void)
{
    
    // initial
    float M[ROW_NUM][COL_NUM] = {
    
        {
    0, 1, 2, 3},
        {
    4, 5, 6, 7},
        {
    8, 9, 10, 11},
        {
    12, 13, 14, 15},
    };
    float MT[ROW_NUM][COL_NUM] = {
    0};
	
    // to do this:
    // a1 b1 c1 d1 => a1 a2 a3 a4
    // a2 b2 c2 d2 => b1 b2 b3 b4
    // ...
    // a4 b4 c4 d4 => d1 d2 d3 d4
 
    // origin
    int32_t i, j;
    for (i = 0; i < ROW_NUM; i++) {
    
        for (j = 0; j < COL_NUM; j++) {
    
            MT[j][i] = M[i][j];
        }
    }

    printf("ver1:\n");
    for (i = 0; i < ROW_NUM; i++) {
    
        for (j = 0; j < COL_NUM; j++) {
    
            printf("%f ", MT[i][j]);
            MT[i][j] = 0.;
        }
        printf("\n");
    }

    // method1
    float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]); // vf32x4x4fTmpABCD in val[0]： a1 b1 c1 d1, val[1]: a2 b2 c2 d2
    float32x4x2_t vf32x4x2fTmpABCD01 = vtrnq_f32(vf32x4x4fTmpABCD.val[0], vf32x4x4fTmpABCD.val[1]); // vf32x4x2fTmpABCD01 in val[0]： a1 a2 c1 c2, val[1]: b1 b2 d1 d2
    float32x4x2_t vf32x4x2fTmpABCD23 = vtrnq_f32(vf32x4x4fTmpABCD.val[2], vf32x4x4fTmpABCD.val[3]);
    float32x4x2_t vf32x4x2fTmpABCD02 = vuzpq_f32(vf32x4x2fTmpABCD01.val[0], vf32x4x2fTmpABCD23.val[0]); // row02,  Group by line 
    float32x4x2_t vf32x4x2fTmpABCD13 = vuzpq_f32(vf32x4x2fTmpABCD01.val[1], vf32x4x2fTmpABCD23.val[1]); // row13,  Group by line 
    vf32x4x2fTmpABCD02 = vtrnq_f32(vf32x4x2fTmpABCD02.val[0], vf32x4x2fTmpABCD02.val[1]);
    vf32x4x2fTmpABCD13 = vtrnq_f32(vf32x4x2fTmpABCD13.val[0], vf32x4x2fTmpABCD13.val[1]);
    vf32x4x4fTmpABCD.val[0] = vf32x4x2fTmpABCD02.val[0]; // a0 a1 a2 a3
    vf32x4x4fTmpABCD.val[2] = vf32x4x2fTmpABCD02.val[1];
    vf32x4x4fTmpABCD.val[1] = vf32x4x2fTmpABCD13.val[0];
    vf32x4x4fTmpABCD.val[3] = vf32x4x2fTmpABCD13.val[1]; // d0 d1 d2 d3
    vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);

    printf("ver2:\n");
    for (i = 0; i < ROW_NUM; i++) {
    
        for (j = 0; j < COL_NUM; j++) {
    
            printf("%f ", MT[i][j]);
            MT[i][j] = 0.;
        }
        printf("\n");
    }


    // method2
    float32x4x4_t vf32x4x4fTmp1ABCD;
    vf32x4x4fTmp1ABCD.val[0] = vld1q_f32(&M[0][0]); // a1 b1 c1 d1
    vf32x4x4fTmp1ABCD.val[1] = vld1q_f32(&M[1][0]);
    vf32x4x4fTmp1ABCD.val[2] = vld1q_f32(&M[2][0]);
    vf32x4x4fTmp1ABCD.val[3] = vld1q_f32(&M[3][0]); // a4 b4 c4 d4
    vst4q_f32(&MT[0][0], vf32x4x4fTmp1ABCD); //  Take advantage of the cross read and write feature , Put in MT Array 

    printf("ver3:\n");
    for (i = 0; i < ROW_NUM; i++) {
    
        for (j = 0; j < COL_NUM; j++) {
    
            printf("%f ", MT[i][j]);
            MT[i][j] = 0.;
        }
        printf("\n");
    }

    //  Only transpose in register and do not output to memory 
    // float fTmpABCD4x4[4][4]; //  Temporary transit array 
    // vst4q_f32(&fTmpABCD4x4[0][0], vf32x4x4fTmpABCD); //  Suppose the data to be transposed is vf32x4x4fTmpABCD
    // vf32x4x4fTmpABCD.val[0] = vld1q_f32(&fTmpABCD4x4[0][0]);
    // vf32x4x4fTmpABCD.val[1] = vld1q_f32(&fTmpABCD4x4[1][0]);
    // vf32x4x4fTmpABCD.val[2] = vld1q_f32(&fTmpABCD4x4[2][0]);
    // vf32x4x4fTmpABCD.val[3] = vld1q_f32(&fTmpABCD4x4[3][0]); //  Put the transpose result into the register 
    
    return 0;
}

At the end of the above code , With... That transposes only in registers demo, It can be used according to specific scenarios .

Summary

Method 1

Transpose in the register and output to the memory
Only operate in registers ,6 Orders , Add 4 Assignments

Method 2

Directly realize the transpose function through the cross reading of memory and register , Command down to 3 strip .
Cross reading between registers and memory ,5 Orders

All in all , Practice knows , Only operate the scene in the register , Law 1 better , It is more efficient than the read-write interaction between memory and registers , Even if there are oneortwo more instructions . When you need to output the results to memory , Law 2 better .

原网站

版权声明
本文为[To know]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207061722311566.html