I think you have an endian issue with how you're printing your input and output.
The left-most bytes within each 64-bit half are the least-significant bytes in your actual output, so 0xfe << 4
becomes 0xe0
, with the f
shifting into a higher byte.
See Convention for displaying vector registers for more discussion of that.
Your "expected" output matches what you'd get if you were printing values high element first (highest address when stored). But that's not what you're doing; you're printing each byte separately in ascending memory order. x86 is little-endian. This conflicts with the numeral system we use in English, where we read Arabic numerals from left to right, highest place-value on the left, effectively human big-endian. Fun fact: The Arabic language reads from right to left so for them, written numbers are "human little-endian".
(And across elements, higher elements are at higher addresses; printing high elements first makes whole-vector shifts like _mm_bslli_si128
aka pslldq
make sense in the way it shifts bytes left between elements.)
If you're using a debugger, you're probably printing within that. If you're using debug-prints, see print a __m128i variable.
BTW, you can use _mm_set1_epi64x(4)
to put the same value in both elements of a vector, instead of using separate l
and r
variables with the same value.
In _mm_set
intrinsics, the high elements come first, matching the diagrams in Intel's asm manuals, and matching the semantic meaning of "left" shift moving bits/bytes to the left. (e.g. see Intel's diagrams an element-numbering for pshufd, _mm_shuffle_epi32
)
BTW, AVX512 has vprolvq
rotates. But yes, to emulate rotates you want a SIMD version of (x << n) | x >> (64-n)
. Note that x86 SIMD shifts saturate the shift count, unlike scalar shifts which mask the count. So x >> 64
will shift out all the bits. If you want to support rotate counts above 63, you probably need to mask.
(Best practices for circular shift (rotate) operations in C++ but you're using intrinsics so you don't have to worry about C shift-count UB, just the actual known hardware behaviour.)
sllv_epi64
shifts within 64-bit elements only, not of full 128-bit numbers. It's also a shift, not a rotate, shifting in zeros instead of shifting in the high bits that get shifted out. AVX512 has rotates (felixcloutier.com/x86/vprold:vprolvd:vprolq:vprolvq), otherwise you need to emulate with shiftleft|right
. But from your desired output, you don't actually want a rotate. – Peter Cordes