The task I'm facing is to shuffle one _m128
vector and store the result in the other one.
The way I see it, there are two basic ways to shuffle a packed floating point _m128
vector:
_mm_shuffle_ps
, which usesSHUFPS
instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move._mm_shuffle_epi32
, which usesPSHUFD
instruction that seems to do exactly what is expected here and can have better latency/throughput thanSHUFPS
.
The latter intrinsic however works with integer vectors (_m128i
) and there seems to be no floating point counterpart, so using it with _m128
would require some ugly explicit casting. Also the fact that there is no such counterpart probably means that there is some proper reason for that, which I am not aware of.
The question is why is there no intrinsic to shuffle one floating point vector and store the result in another?
If _mm_shuffle_ps(x,x, ...)
can generate PSHUFPD
, can it be guaranteed?
If PSHUFD
should not be used for floating point values, what is the reason for that?
Thank you!
_mm_shuffle_pd
does exist – harold__m128 y = _mm_shuffle_ps(x, x, shuf_mask);
? Shuffles are very fast; there's no performance gain to be made by them only taking one input. If the look of the code bothers you, then you can write an inline wrapper function or macro. AVX introduced_mm_permute_ps()
, which takes one input as you're looking for. – Jason RPSHUFD
instruction from a_mm_shuffle_ps()
call. Can you provide an example? Also, according to Intel's intrinsics guide, the two instructions have the same throughput and latency on all recent architectures (barring any bypass delays from moving between FP and integer domains). – Jason RPSHUFD
? You haven't cited any verifiable reason for why you believe it's better. It's actually likely to be slower due to domain crossing in the SIMD unit. – Jason RPSHUFD
is faster thanSHUFPS
? As for the reference you asked for, the definitive one is Agner Fog's online resources. See pp. 112 & 129 on agner.org/optimize/microarchitecture.pdf. – Cody Gray