当前位置:网站首页>Neon Optimization: About Cross access and reverse cross access
Neon Optimization: About Cross access and reverse cross access
2022-07-07 01:14:00 【To know】
NEON Optimize : About cross access and reverse cross access
NEON Optimization series :
- NEON Optimize 1: Software performance optimization 、 How to reduce power consumption ?link
- NEON Optimize 2:ARM Summary of optimized high frequency instructions , link
- NEON Optimize 3: Matrix transpose instruction optimization case ,link
- NEON Optimize 4:floor/ceil Optimization case of function ,link
- NEON Optimize 5:log10 Optimization case of function ,link
- NEON Optimize 6: About cross access and reverse cross access ,link
- NEON Optimize 7: Performance optimization experience summary ,link
- NEON Optimize 8: Performance optimization FAQs QA,link
background
NEON In the process of optimization , Often encounter memory 、 Read and write between memory variables ,NEON Memory read / write instructions are interleaved by default , Some special instructions can be reverse interleaved .
What is cross access , What is reverse cross access ?
- Cross reading and writing :ld2q/3q/4q, st2q/3q/4q, zip
- explain : At intervals ( The number is 2q/3q/4q Number in ) Enter into the corresponding register
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2
,ld2q Read to 2 The registers are :val[0]: a1 a2, val[1]: b1 b2
- Reverse cross reading and writing :ld1q/st1q, uzp
- explain : Input to the register successively according to the memory direction
- give an example : For example, the continuous data stored in memory is
a1 b1 a2 b2
,ld1q Read to 1 The registers are :val[0]: a1 b1 a2 b2
Correlation function
Mainly read and write from memory 、 The interaction between registers is explained .
- Memory interacts with registers
- ld1q/st1q
- only 1 The functions of dimension cross reading and writing are consistent with those of normal reading and writing
- ld2q/st2q And 3q、4q
- effect : Are cross read and write , The purpose is to deal with different channels 、 Dimension information
- explain : Read data vertically , Write data horizontally ( Read by column , Write by line )
- Be careful :ld4q/st4q When used in pairs , It can be restored , It is equivalent to transposing the matrix and putting it into the register , Then transpose it back to memory
- ld1q/st1q
- Register to register interaction
- vzip Cross access
- Instructions :
int32x4x2_t vzipq_s32(int32x4_t a, int32x4_t b);
- paraphrase :
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 4 1 5,val[1]:2 6 3 7
- explain : Read data vertically , Write data horizontally ; Reading is W type , Write is — type .
- Specifically : First reading a A data , read b A data , become 04152637, Horizontal completion val0 Where 4 After a value (0415), Write again val1 The remaining 4 It's worth (2637)
- Instructions :
- uzpq Reverse cross access
- Instructions :
int32x4x2_t vuzpq_s32(int32x4_t a, int32x4_t b);
- paraphrase : Similar to deinterleaving channel data
- Input :a = {0 1 2 3},b = {4 5 6 7}
- Output :val[0]:0 2 4 6,val[1]:1 3 5 7
- explain : Read data horizontally , Write data vertically ; Reading is — type , Write is W type .
- Specifically :a and b The values of are read out sequentially , become 01234567, Put it val[0]/val[1] Write in by column .
- Instructions :
- vzip Cross access
Test code
ld4q/st4q A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
int i, j;
float32x4x4_t vf32x4x4fTmpABCD = vld4q_f32(&M[0][0]);
float MT[4][4];
vst1q_f32(&MT[0][0], vf32x4x4fTmpABCD.val[0]); // 0 4 8 12
vst1q_f32(&MT[1][0], vf32x4x4fTmpABCD.val[1]);
vst1q_f32(&MT[2][0], vf32x4x4fTmpABCD.val[2]);
vst1q_f32(&MT[3][0], vf32x4x4fTmpABCD.val[3]); // 3 7 11 15
printf("ver1:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
vst4q_f32(&MT[0][0], vf32x4x4fTmpABCD);
printf("ver2:\n");
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Output results
ver1:
0.000000 4.000000 8.000000 12.000000
1.000000 5.000000 9.000000 13.000000
2.000000 6.000000 10.000000 14.000000
3.000000 7.000000 11.000000 15.000000
ver2:
0.000000 1.000000 2.000000 3.000000
4.000000 5.000000 6.000000 7.000000
8.000000 9.000000 10.000000 11.000000
12.000000 13.000000 14.000000 15.000000
zip/uzp A functional test
#define ROW_NUM 4
#define COL_NUM 4
// initial
float M[ROW_NUM][COL_NUM] = {
{
0, 1, 2, 3},
{
4, 5, 6, 7},
{
8, 9, 10, 11},
{
12, 13, 14, 15},
};
float MT[4][4];
// Read by column , Write by line
float32x4_t vf32x4fTmp1 = vld1q_f32(&M[0][0]); // 0 1 2 3
float32x4_t vf32x4fTmp2 = vld1q_f32(&M[1][0]); // 4 5 6 7
float32x4x2_t vf32x4x2fTmpZip = vzipq_f32(vf32x4fTmp1, vf32x4fTmp2);
vst1q_f32(&MT[0][0], vf32x4x2fTmpZip.val[0]); // 0 4 1 5
vst1q_f32(&MT[1][0], vf32x4x2fTmpZip.val[1]); // 2 6 3 7
// According to the line read , Write by column
float32x4_t vf32x4fTmp3 = vld1q_f32(&M[2][0]); // 8 9 10 11
float32x4_t vf32x4fTmp4 = vld1q_f32(&M[3][0]); // 12 13 14 15
float32x4x2_t vf32x4x2fTmpUzp = vuzpq_f32(vf32x4fTmp3, vf32x4fTmp4);
vst1q_f32(&MT[2][0], vf32x4x2fTmpUzp.val[0]); // 8 10 12 14
vst1q_f32(&MT[3][0], vf32x4x2fTmpUzp.val[1]); // 9 11 13 15
printf("ver1:\n");
int i, j;
for (i = 0; i < ROW_NUM; i++) {
for (j = 0; j < COL_NUM; j++) {
printf("%f ", MT[i][j]);
MT[i][j] = 0.;
}
printf("\n");
}
Summary
With the above comparisons , Cross access 、 Reverse cross access , Simple view , Imagine a matrix , Cross access is to read by column , Write to the new variable by line , Reverse cross access is read by line , Write it in by column , That's all .
边栏推荐
- 一行代码实现地址信息解析
- [Niuke] [noip2015] jumping stone
- Address information parsing in one line of code
- 界面控件DevExpress WinForms皮肤编辑器的这个补丁,你了解了吗?
- Taro2.* 小程序配置分享微信朋友圈
- The cost of returning tables in MySQL
- Anfulai embedded weekly report no. 272: 2022.06.27--2022.07.03
- Dell Notebook Periodic Flash Screen Fault
- A brief history of deep learning (II)
- Make a simple graphical interface with Tkinter
猜你喜欢
[case sharing] basic function configuration of network loop detection
Dell筆記本周期性閃屏故障
UI控件Telerik UI for WinForms新主题——VS2022启发式主题
批量获取中国所有行政区域经边界纬度坐标(到县区级别)
城联优品入股浩柏国际进军国际资本市场,已完成第一步
Explain in detail the matrix normalization function normalize() of OpenCV [norm or value range of the scoped matrix (normalization)], and attach norm_ Example code in the case of minmax
pytorch之数据类型tensor
Tensorflow GPU installation
"Exquisite store manager" youth entrepreneurship incubation camp - the first phase of Shunde market has been successfully completed!
重上吹麻滩——段芝堂创始人翟立冬游记
随机推荐
接收用户输入,身高BMI体重指数检测小业务入门案例
【js】获取当前时间的前后n天或前后n个月(时分秒年月日都可)
The cost of returning tables in MySQL
SuperSocket 1.6 创建一个简易的报文长度在头部的Socket服务器
Address information parsing in one line of code
Windows installation mysql8 (5 minutes)
【案例分享】网络环路检测基本功能配置
BFS realizes breadth first traversal of adjacency matrix (with examples)
Openjudge noi 1.7 08: character substitution
[Niuke] b-complete square
【JVM调优实战100例】05——方法区调优实战(下)
Link sharing of STM32 development materials
Dell筆記本周期性閃屏故障
C Primer Plus Chapter 14 (structure and other data forms)
Dynamic planning idea "from getting started to giving up"
「精致店主理人」青年创业孵化营·首期顺德场圆满结束!
深度学习框架TF安装
mysql: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such
Fastdfs data migration operation record
Can the system hibernation file be deleted? How to delete the system hibernation file