
I'm trying to understand possibly bypass delays when switching domains of execution units.

For example, the following two lines of code give exactly the same result.

_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
_mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));

Which line of code is better to use?

The assembly output for the first line gives:

vpslldq xmm1, xmm0, 8
vaddps  xmm0, xmm1, xmm0

The assembly output for the second line gives:

vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64   ; 00000040H
vaddps  xmm2, xmm1, XMMWORD PTR [rcx]

Now if I look at Agner Fog's microarchitecture manual he gives an example on page 112 of using a integer shuffle (pshufd) on float values versus using a float shuffle (shufps) on float values. Switching domains adds 4 extra clock cycles so the solution using shufps is better.

The first line of code I listed using _mm_slli_si128 has to switch domains between integer and float vectors. The second using _mm_shuffle_ps stays in the same domain. Doesn't this imply that the second line of code is the better solution?

Have you tried benchmarking this?Leeor
No, not yet. But I have some code to do it with. If you want to see why I'm interested see the answer here and the prefix_sum_SSE function.Z boson

Section 2.1.4 in the Intel optimization guide indicates that you (and Agner) are quite right on this matter -

When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a one- or two-cycle delay can occur. The delay occurs also for tran-sitions between Intel SSE integer and Intel SSE floating-point operation.

So in general it seems you'd be better off keeping within the same stack/domain as much as possible.

Of course benchmarking is always preferred, and all this is worth handling only in case this is indeed a bottleneck in your code.