Best way to shuffle 64-bit portions of two __m128i's

Question

I have two __m128is, a and b, that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst. i.e.

dst[ 0:63]  = a[64:127]
dst[64:127] = b[0:63]

Equivalent to:

__m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b);

or

__m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1));

Is there a better way to do this than the first method? The second one is just one instruction, but the switch to the floating point SIMD execution is more costly than the extra instruction from the first.

I don't know of a better way yet. _mm_blend_epi16(a,b,7) from SSE 4.1 would give you the upper part of a in the upper part of dest and the lower part of b in the lower part of dest but then you would still have to swap. — Z boson
You can just swap with another _mm_shuffle_epi32(dst,0x4e) that could be faster than a shift and unpack (especially if blend is a 3 register instruction?) but I'd really prefer an ssse3 or earlier solution — Steve Cox
I don't know of a better method from ssse3 or lower. You can look for them at software.intel.com/sites/landingpage/IntrinsicsGuide — Z boson
@SteveCox, _mm_shuffle_epi32 is available in SSE2. I've been using it on machines that don't support > SSE2 and it works fine. You definitely don't want to cross over from the integer instructions to the doubles. It is likely to introduce extra latency according to the Intel docs. — Marty

Peter Cordes Peter Cordes · Accepted Answer · 2015-07-06T02:03:29

Latency isn't always the worst thing ever. If it's not part of a loop-carried dep-chain, then just use the single instruction.

Also, there might not be any! Agner Fog's microarch doc says he found no extra latency in some cases when using the "wrong" type of shuffle or boolean, on Sandybridge. Blends still have the extra latency. On Haswell, he says there are no extra delays at all for mixing types of shuffle. (pg 140, Data Bypass Delays.)

So go ahead and use shufps, unless you care a lot about your code being fast on Nehalem. (Previous designs (merom/conroe, and Penryn) didn't have extra bypass delays for using the wrong move or shuffle.)

For AMD, shufps runs in the ivec domain, same as integer shuffles, so it's fine to use it. Like Intel, FP blends run in the FP domain, and thus have no bypass delay for FP data.

If you include multiple asm versions depending on which instruction sets are supported, without going completely nuts about having the optimal version of everything for every CPU like x264 does, you might use wrong-type ops in your version for AVX CPUs, but use multiple instructions in your non-AVX version. Nehalem has large penalties (2 cycle bypass delays for each domain transition), while Sandybridge is 0 or 1 cycle. SnB is the first generation with AVX.

Pre-Nehalem (no SSE4.2) is so old that it's probably not worth tuning a version specifically for it, even though it doesn't have any penalties for "wrong type" shuffles. Nehalem is right on the cusp of being kinda slow, so software running on those systems will have the hardest time operating in real-time, or not feeling slow. Thus, being bad on Nehalem would add to a bad user experience since their system is already not the fastest.

Best way to shuffle 64-bit portions of two __m128i's

1 Answers