Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.
Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)
So go ahead and use shufps
, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)
For AMD, shufps
runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.
If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.
Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.
_mm_blend_epi16(a,b,7)
from SSE 4.1 would give you the upper part of a in the upper part of dest and the lower part of b in the lower part of dest but then you would still have to swap. – Z boson_mm_shuffle_epi32(dst,0x4e)
that could be faster than a shift and unpack (especially if blend is a 3 register instruction?) but I'd really prefer an ssse3 or earlier solution – Steve Cox