当前位置:网站首页>Neon Optimization: About Cross access and reverse cross access
Neon Optimization: About Cross access and reverse cross access
2022-07-07 01:14:00 【To know】
NEON Optimize : About cross access and reverse cross access
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
NEON In the process of optimization , Often encounter memory 、 Read and write between memory variables ,NEON Memory read / write instructions are interleaved by default , Some special instructions can be reverse interleaved .
What is cross access , What is reverse cross access ?
- Cross reading and writing :ld2q/3q/4q, st2q/3q/4q, zip
- explain : At intervals ( The number is 2q/3q/4q Number in ) Enter into the corresponding register
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2,ld2q Read to 2 The registers are :val[0]: a1 a2, val[1]: b1 b2
- Reverse cross reading and writing :ld1q/st1q, uzp
- explain : Input to the register successively according to the memory direction
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2,ld1q Read to 1 The registers are :val[0]: a1 b1 a2 b2
Correlation function
Mainly read and write from memory 、 The interaction between registers is explained .
- Memory interacts with registers
- ld1q/st1q
- only 1 The functions of dimension cross reading and writing are consistent with those of normal reading and writing
- ld2q/st2q And 3q、4q
- effect : Are cross read and write , The purpose is to deal with different channels 、 Dimension information
- explain : Read data vertically , Write data horizontally ( Read by column , Write by line )
- Be careful :ld4q/st4q When used in pairs , It can be restored , It is equivalent to transposing the matrix and putting it into the register , Then transpose it back to memory
- ld1q/st1q
- Register to register interaction
- vzip Cross access
- Instructions :
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b); - paraphrase :
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 4 1 5,val[1]:2 6 3 7
- explain : Read data vertically , Write data horizontally ; Reading is W type , Write is — type .
- Specifically : First reading a A data , read b A data , become 04152637, Horizontal completion val0 Where 4 After a value (0415), Write again val1 The remaining 4 It's worth (2637)
- Instructions :
- uzpq Reverse cross access
- Instructions :
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b); - paraphrase : Similar to deinterleaving channel data
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 2 4 6,val[1]:1 3 5 7
- explain : Read data horizontally , Write data vertically ; Reading is — type , Write is W type .
- Specifically :a and b The values of are read out sequentially , become 01234567, Put it val[0]/val[1] Write in by column .
- Instructions :
- vzip Cross access
Test code
ld4q/st4q A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Output results
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// Read by column , Write by line
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// According to the line read , Write by column
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Summary
With the above comparisons , Cross access 、 Reverse cross access , Simple view , Imagine a matrix , Cross access is to read by column , Write to the new variable by line , Reverse cross access is read by line , Write it in by column , That's all .
边栏推荐
猜你喜欢

Part V: STM32 system timer and general timer programming
Deeply explore the compilation and pile insertion technology (IV. ASM exploration)

HMM notes

Dynamic planning idea "from getting started to giving up"

"Exquisite store manager" youth entrepreneurship incubation camp - the first phase of Shunde market has been successfully completed!
![[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error](/img/a5/41d4cbc070d421093323dc189a05cf.png)
[batch dos-cmd command - summary and summary] - jump, cycle, condition commands (goto, errorlevel, if, for [read, segment, extract string]), CMD command error summary, CMD error

Dell Notebook Periodic Flash Screen Fault

重上吹麻滩——段芝堂创始人翟立冬游记

Make a simple graphical interface with Tkinter

windows安装mysql8(5分钟)
随机推荐
Dell Notebook Periodic Flash Screen Fault
身体质量指数程序,入门写死的小程序项目
随时随地查看远程试验数据与记录——IPEhub2与IPEmotion APP
ARM裸板调试之JTAG原理
Part VI, STM32 pulse width modulation (PWM) programming
Mongodb client operation (mongorepository)
A brief history of deep learning (II)
做微服务研发工程师的一年来的总结
The MySQL database in Alibaba cloud was attacked, and finally the data was found
Lldp compatible CDP function configuration
Part IV: STM32 interrupt control programming
golang中的WaitGroup实现原理
让我们,从头到尾,通透网络I/O模型
pyflink的安装和测试
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr
Niuke cold training camp 6B (Freund has no green name level)
Tencent cloud webshell experience
[Niuke] [noip2015] jumping stone
动态规划思想《从入门到放弃》