当前位置:网站首页>Neon Optimization: About Cross access and reverse cross access
Neon Optimization: About Cross access and reverse cross access
2022-07-07 01:14:00 【To know】
NEON Optimize : About cross access and reverse cross access
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
NEON In the process of optimization , Often encounter memory 、 Read and write between memory variables ,NEON Memory read / write instructions are interleaved by default , Some special instructions can be reverse interleaved .
What is cross access , What is reverse cross access ?
- Cross reading and writing :ld2q/3q/4q, st2q/3q/4q, zip
- explain : At intervals ( The number is 2q/3q/4q Number in ) Enter into the corresponding register
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2,ld2q Read to 2 The registers are :val[0]: a1 a2, val[1]: b1 b2
- Reverse cross reading and writing :ld1q/st1q, uzp
- explain : Input to the register successively according to the memory direction
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2,ld1q Read to 1 The registers are :val[0]: a1 b1 a2 b2
Correlation function
Mainly read and write from memory 、 The interaction between registers is explained .
- Memory interacts with registers
- ld1q/st1q
- only 1 The functions of dimension cross reading and writing are consistent with those of normal reading and writing
- ld2q/st2q And 3q、4q
- effect : Are cross read and write , The purpose is to deal with different channels 、 Dimension information
- explain : Read data vertically , Write data horizontally ( Read by column , Write by line )
- Be careful :ld4q/st4q When used in pairs , It can be restored , It is equivalent to transposing the matrix and putting it into the register , Then transpose it back to memory
- ld1q/st1q
- Register to register interaction
- vzip Cross access
- Instructions :
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b); - paraphrase :
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 4 1 5,val[1]:2 6 3 7
- explain : Read data vertically , Write data horizontally ; Reading is W type , Write is — type .
- Specifically : First reading a A data , read b A data , become 04152637, Horizontal completion val0 Where 4 After a value (0415), Write again val1 The remaining 4 It's worth (2637)
- Instructions :
- uzpq Reverse cross access
- Instructions :
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b); - paraphrase : Similar to deinterleaving channel data
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 2 4 6,val[1]:1 3 5 7
- explain : Read data horizontally , Write data vertically ; Reading is — type , Write is W type .
- Specifically :a and b The values of are read out sequentially , become 01234567, Put it val[0]/val[1] Write in by column .
- Instructions :
- vzip Cross access
Test code
ld4q/st4q A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Output results
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// Read by column , Write by line
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// According to the line read , Write by column
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Summary
With the above comparisons , Cross access 、 Reverse cross access , Simple view , Imagine a matrix , Cross access is to read by column , Write to the new variable by line , Reverse cross access is read by line , Write it in by column , That's all .
边栏推荐
- Install Firefox browser on raspberry pie /arm device
- 第六篇,STM32脉冲宽度调制(PWM)编程
- windows安装mysql8(5分钟)
- Dell Notebook Periodic Flash Screen Fault
- Taro 小程序开启wxml代码压缩
- Activereportsjs 3.1 Chinese version | | | activereportsjs 3.1 English version
- Boot - Prometheus push gateway use
- Openjudge noi 1.7 10: simple password
- Asset security issues or constraints on the development of the encryption industry, risk control + compliance has become the key to breaking the platform
- 一行代码实现地址信息解析
猜你喜欢

BFS realizes breadth first traversal of adjacency matrix (with examples)

Asset security issues or constraints on the development of the encryption industry, risk control + compliance has become the key to breaking the platform

资产安全问题或制约加密行业发展 风控+合规成为平台破局关键

【案例分享】网络环路检测基本功能配置

LLDP兼容CDP功能配置

boot - prometheus-push gateway 使用

第五篇,STM32系统定时器和通用定时器编程

第六篇,STM32脉冲宽度调制(PWM)编程

力扣1037. 有效的回旋镖

城联优品入股浩柏国际进军国际资本市场,已完成第一步
随机推荐
系统休眠文件可以删除吗 系统休眠文件怎么删除
深度学习框架TF安装
Fastdfs data migration operation record
Eventbus source code analysis
动态规划思想《从入门到放弃》
线段树(SegmentTree)
Installation and testing of pyflink
Taro中添加小程序 “lazyCodeLoading“: “requiredComponents“,
Force buckle 1037 Effective boomerang
MySQL中回表的代价
[Batch dos - cmd Command - Summary and Summary] - String search, find, Filter Commands (FIND, findstr), differentiation and Analysis of Find and findstr
【js】获取当前时间的前后n天或前后n个月(时分秒年月日都可)
Summary of being a microservice R & D Engineer in the past year
Atomic in golang and CAS operations
UI控件Telerik UI for WinForms新主题——VS2022启发式主题
Taro2.* 小程序配置分享微信朋友圈
第五篇,STM32系统定时器和通用定时器编程
Chenglian premium products has completed the first step to enter the international capital market by taking shares in halber international
Dell Notebook Periodic Flash Screen Fault
深度学习简史(一)