8
votes

With SSE you can load a single float from memory into all 4 slots of a __m128 with the intrinsic _mm_load1_ps()

When using 256 bit wide SIMD with AVX, there seems to be no _mm256_load1_ps() to load a single float from memory into all 8 slots of the vector.

Why is this omission, and what's the best way around this?

Or even better: is there a way to load a single float to a targeted slot 0..7 of the vector?

1
AVX and AVX2 still only allow you to insert elements into the low 128 (PINSRD / INSERTPS: element number = compile-time constant). Doing this without zeroing the upper128 is only possible with the non-VEX encoding, triggering a massive slowdown on Intel pre-Skylake from mixing VEX and non-VEX instructions. You could extractf128, insertps, insertf128.Peter Cordes
_mm_load1_ps is a composite intrinsic for movss + shuffle to broadcast a float. If you were already willing to let the compiler do whatever it felt like to get a constant into a register, _mm256_set1_ps(*f) is a good choice. Smart compilers will emit VBROADCASTSS where appropriate.Peter Cordes

1 Answers

11
votes