当前位置:网站首页>The SSE instructions into ARM NEON
The SSE instructions into ARM NEON
2022-08-02 15:26:00 【Hongyao】
Related Information
● sse instruction set: sse instruction explanation
● sse2neon repository: You can find the corresponding neon instruction conversion method in sse2neon.h
Notes
● Converting sse instructions to arm neon instructions is often difficult to optimize, and may even result in negative optimization, so this part of the optimization is for reference only.
__mm_shuffle_ps conversion
The function of __mm_shuffle_ps is to take two elements from m1 and put them in the low position of m3. According to the last two arrays of _MM_SHUFFLE(i3,i2,i1,i0), take two elements from m2 and put them in m3The high bits are based on the first two numbers of _MM_SHUFFLE(i3,i2,i1,i0).
For the conversion of __mm_shuffle_ps, most of sse2neon uses the combination of load and store instructions and type conversion operations, such as the following code, corresponding to __mm_shuffle_ps(a,b,__MM_SHUFFLE(2,2,0,0))
.
FORCE_INLINE __m128 _mm_shuffle_ps_2200(__m128 a, __m128 b){float32x2_t a00 = vdup_lane_f32(vget_low_f32(vreinterpretq_f32_m128(a)), 0);float32x2_t b22 =vdup_lane_f32(vget_high_f32(vreinterpretq_f32_m128(b)), 0);return vreinterpretq_m128_f32(vcombine_f32(a00, b22));}
Directly using a conversion like the above will definitely cause the performance not to increase but to decrease. The best way is to find similar operations in neon. This part of the operation is mainly concentrated in permutation
, such as vtrn,vrev,vzip,vuzp
, etc.
For example, in the above example: if you need to get at the same time __mm_shuffle_ps(a,a,__MM_SHUFFLE(2,2,0,0))
and __mm_shuffle_ps(a,a,__MM_SHUFFLE(3,3,1,1))
, you can use vtrnq_32f(a,a)
to get the result, the result is float32x4x2_t
type, val[0]
corresponds to 2200
, val[1]
corresponds to 3311
.
边栏推荐
- 对疫情期间量化策略表现的看法
- What is Win10 God Mode for?How to enable God Mode in Windows 10?
- Please make sure you have the correct access rights and the repository exists. Problem solved
- Win11 system cannot find dll file how to fix
- 13.56MHZ刷卡芯片CI521兼容cv520/ci520支持A卡B卡MIFARE协议
- 镜像法求解接地导体空腔电势分布问题
- 深入理解Golang之Map
- 用U盘怎么重装Win7系统?如何使用u盘重装系统win7?
- pygame拖动条的实现方法
- jest test, component test
猜你喜欢
What should I do if the Win10 system sets the application identity to automatically prompt for access denied?
用U盘怎么重装Win7系统?如何使用u盘重装系统win7?
STM32LL库使用——SPI通信
cmake配置libtorch报错Failed to compute shorthash for libnvrtc.so
PyTorch④---DataLoader的使用
Do Windows 10 computers need antivirus software installed?
pygame绘制弧线
2020-02-06-快速搭建个人博客
Binder机制(中篇)
【系统设计与实现】基于flink的分心驾驶预测与数据分析系统
随机推荐
PyTorch(15)---模型保存和加载
【我的电赛日记(三)】STM32学习笔记与要点总结
vscode镜像
镜像法求解接地导体空腔电势分布问题
Makefile容易犯错的语法
win10怎么设置不睡眠熄屏?win10设置永不睡眠的方法
win10无法直接用照片查看器打开图片怎么办
如何用硬币模拟1/3的概率,以及任意概率?
Win11怎么在右键菜单添加一键关机选项
What should I do if the Win10 system sets the application identity to automatically prompt for access denied?
神经网络的设计过程
SQL的通用语法和使用说明(图文)
golang之GMP调度模型
【我的电赛日记(一)】HMI USART串口屏
Win10 can't start WampServer icon is orange solution
编译error D8021 :无效的数值参数“/Wextra” cl command line error d8021 invalid numeric argument ‘/wextra‘
系统线性、时不变、因果判断
pygame图像连续旋转
PyTorch①---加载数据、tensorboard的使用
PyTorch⑨---卷积神经网络_线性层