当前位置:网站首页>The SSE instructions into ARM NEON

The SSE instructions into ARM NEON

2022-08-02 15:26:00 Hongyao

Related Information

sse instruction set: sse instruction explanation
sse2neon repository: You can find the corresponding neon instruction conversion method in sse2neon.h

Notes

● Converting sse instructions to arm neon instructions is often difficult to optimize, and may even result in negative optimization, so this part of the optimization is for reference only.

__mm_shuffle_ps conversion

The function of __mm_shuffle_ps is to take two elements from m1 and put them in the low position of m3. According to the last two arrays of _MM_SHUFFLE(i3,i2,i1,i0), take two elements from m2 and put them in m3The high bits are based on the first two numbers of _MM_SHUFFLE(i3,i2,i1,i0).

insert image description here

For the conversion of __mm_shuffle_ps, most of sse2neon uses the combination of load and store instructions and type conversion operations, such as the following code, corresponding to __mm_shuffle_ps(a,b,__MM_SHUFFLE(2,2,0,0)).

FORCE_INLINE __m128 _mm_shuffle_ps_2200(__m128 a, __m128 b){float32x2_t a00 = vdup_lane_f32(vget_low_f32(vreinterpretq_f32_m128(a)), 0);float32x2_t b22 =vdup_lane_f32(vget_high_f32(vreinterpretq_f32_m128(b)), 0);return vreinterpretq_m128_f32(vcombine_f32(a00, b22));}

Directly using a conversion like the above will definitely cause the performance not to increase but to decrease. The best way is to find similar operations in neon. This part of the operation is mainly concentrated in permutation, such as vtrn,vrev,vzip,vuzp, etc.
For example, in the above example: if you need to get at the same time __mm_shuffle_ps(a,a,__MM_SHUFFLE(2,2,0,0))and __mm_shuffle_ps(a,a,__MM_SHUFFLE(3,3,1,1)), you can use vtrnq_32f(a,a) to get the result, the result is float32x4x2_t type, val[0] corresponds to 2200, val[1] corresponds to 3311.

原网站

版权声明
本文为[Hongyao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/214/202208021403332076.html