当前位置:网站首页>The SSE instructions into ARM NEON
The SSE instructions into ARM NEON
2022-08-02 15:26:00 【Hongyao】
Related Information
● sse instruction set: sse instruction explanation
● sse2neon repository: You can find the corresponding neon instruction conversion method in sse2neon.h
Notes
● Converting sse instructions to arm neon instructions is often difficult to optimize, and may even result in negative optimization, so this part of the optimization is for reference only.
__mm_shuffle_ps conversion
The function of __mm_shuffle_ps is to take two elements from m1 and put them in the low position of m3. According to the last two arrays of _MM_SHUFFLE(i3,i2,i1,i0), take two elements from m2 and put them in m3The high bits are based on the first two numbers of _MM_SHUFFLE(i3,i2,i1,i0).
For the conversion of __mm_shuffle_ps, most of sse2neon uses the combination of load and store instructions and type conversion operations, such as the following code, corresponding to __mm_shuffle_ps(a,b,__MM_SHUFFLE(2,2,0,0))
.
FORCE_INLINE __m128 _mm_shuffle_ps_2200(__m128 a, __m128 b){float32x2_t a00 = vdup_lane_f32(vget_low_f32(vreinterpretq_f32_m128(a)), 0);float32x2_t b22 =vdup_lane_f32(vget_high_f32(vreinterpretq_f32_m128(b)), 0);return vreinterpretq_m128_f32(vcombine_f32(a00, b22));}
Directly using a conversion like the above will definitely cause the performance not to increase but to decrease. The best way is to find similar operations in neon. This part of the operation is mainly concentrated in permutation
, such as vtrn,vrev,vzip,vuzp
, etc.
For example, in the above example: if you need to get at the same time __mm_shuffle_ps(a,a,__MM_SHUFFLE(2,2,0,0))
and __mm_shuffle_ps(a,a,__MM_SHUFFLE(3,3,1,1))
, you can use vtrnq_32f(a,a)
to get the result, the result is float32x4x2_t
type, val[0]
corresponds to 2200
, val[1]
corresponds to 3311
.
边栏推荐
猜你喜欢
随机推荐
“非图灵完备”到底意味着什么
Win11 keeps popping up User Account Control how to fix it
HAL框架
How to add a one-key shutdown option to the right-click menu in Windows 11
编译error D8021 :无效的数值参数“/Wextra” cl command line error d8021 invalid numeric argument ‘/wextra‘
IPV4和IPV6是什么?
DP1101兼容CC1101是SUB1GHz无线收发芯片应用于智能家居
ARMv8虚拟化
What is Win10 God Mode for?How to enable God Mode in Windows 10?
Makefile容易犯错的语法
vscode镜像
PyTorch(13)---优化器_随机梯度下降法
cmake配置libtorch报错Failed to compute shorthash for libnvrtc.so
pygame拖动条的实现方法
Win11怎么在右键菜单添加一键关机选项
Binder机制(中篇)
13.56MHZ刷卡芯片CI521兼容cv520/ci520支持A卡B卡MIFARE协议
刷卡芯片CI520可直接PIN对PIN替换CV520支持SPI通讯接口
STM32LL库使用——SPI通信
使用libcurl将Opencv Mat的图像上传到文件服务器,基于post请求和ftp协议两种方法