3
votes

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of bytes, which makes it exactly the same as _mm_bslli_si128.

  • Is this an oversight? Shouldn't it be shifting by bits like _mm_slli_epi32 or _mm_slli_epi64?
  • If not, in which situation should I use this over _mm_bslli_si128?
  • Is there an assembly instruction which does this correctly?
  • What is the best way of emulating this with smaller shifts?
2
My comparison of older and newer documentaton suggests that the instruction (V)PSLLDQ, which shifts byte-wise, was first exposed via an inconsistently named intrinsic (using "slli", incorrectly suggesting bit shift), while the consistently named intrinsic (using "bslli", correctly suggesting byte shift) wasn't added until much later, at which point it was not possible to remove the old intrinsic without breaking existing code. For new code, use of the "bslli" variant therefore seems preferable as the more appropriately named intrinsics.njuffa
I sort of suspected it to be a historical artifact, but your comment confirms thatlennartVH01

2 Answers

5
votes

1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.

2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.

As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.

template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
    static_assert( i >= 0 && i < 128 );
    // Handle couple trivial cases
    if constexpr( 0 == i )
        return vec;
    if constexpr( 0 == ( i % 8 ) )
        return _mm_slli_si128( vec, i / 8 );

    if constexpr( i > 64 )
    {
        // Shifting by more than 8 bytes, the lowest half will be all zeros
        vec = _mm_slli_si128( vec, 8 );
        return _mm_slli_epi64( vec, i - 64 );
    }
    else
    {
        // Shifting by less than 8 bytes.
        // Need to propagate a few bits across 64-bit lanes.
        __m128i low = _mm_slli_si128( vec, 8 );
        __m128i high = _mm_slli_epi64( vec, i );
        low = _mm_srli_epi64( low, 64 - i );
        return _mm_or_si128( low, high );
    }
}
4
votes

TL:DR: They're synonyms; the bslli name is newer, introduced around the same time as new AVX-512 intrinsics (sometime before 2015, long after SSE2 _mm_slli_si128 was in widespread usage). I find it clearer and would recommend it for new development.


SSE/AVX2/AVX-512 do not have bit-shifts with element sizes wider than 64. (Or any other bit-granularity operation like add, except pure-vertical bitwise boolean stuff that's really 128 fully separate operations, not one big wide one. Or for AVX-512 masking and broadcast-load purposes, can be in dword or qword chunks like _mm512_xor_epi32 / vpxord)

You have to emulate it somehow, which can be fairly efficient for compile-time-constant counts so you can pick between strategies according to c >= 64, with special cases for c%8 reducing to a byte-shift. Existing SO Q&As cover that, or see @Soonts' answer on this Q.

Runtime-variable counts would suck; you'd have to branch or do both ways and blend, unlike for element bit-shifts where _mm_sll_epi64(v, _mm_cvtsi32_si128(i)) can compile to movd / psllq xmm, xmm. Unfortunately, hardware variable-count versions of byte-shuffle/shift instructions don't exist, only for the bit-shift versions.


bslli / bsrli are new, clearer intrinsic names for the same asm instructions

The b names are supported in current version of all 4 major compilers for x86 (Godbolt), and I'd recommend them for new development unless you need backwards compat with crusty old compilers, or for some reason you like the old name that doesn't both to distinguish it from different operations. (e.g. familiarity; if you don't want people to have to look up this newfangled name in the manual.)

  • gcc since 4.8
  • clang since 3.7
  • ICC since ICC13 or earlier, Godbolt doesn't have any older
  • MSVC since 19.14 or earlier, Godbolt doesn't have any older

If you check the intrinsics guide, _mm_slli_si128 is listed as an intrinsic for PSLLDQ, which is a byte shift. This is not a bug, just Intel's idea of a joke, or whatever process they used to choose names for intrinsics back in the SSE2 days. (There are only 2 hard problems in computer science: cache invalidation and naming things).

Asm mnemonics also use the same pattern of not making the byte-shuffle one look different from the bit-shifts. psllw xmm, 1 / pslld / psllq / pslldq. Again, you just have to know that 128-bit size is special, and must be a byte shuffle not a bit-shift, because x86 never has that. (Or you have to check the manual.)

The asm manual entry for pslldq in turn lists intrinsics for forms of it, interestingly only using the b name for the __m512i AVX-512BW version. When SSE2 and AVX2 were new, _mm_slli_si128 and _mm256_slli_si256 were the only names available, I think. Certainly it post-dates SSE2 intrinsics.

(Note that the si256 and si512 versions are just 2 or 4 copies of the 16-byte operation, not shifting bytes across 128-bit lanes; something a few other Q&As have asked for. This often makes AVX2 versions of shuffles like this and palignr a lot less useful than they'd otherwise be: either not worth using at all, or needing extra shuffles on top of it.)

I think this new bslli name was introduced when AVX-512 was new. Intel invented some new names for other intrinsics around that time, and the AVX-512 load/store intrinsics take void* instead of __m512i*, which is a major improvement to amount of noise in code, especially for C where implicit conversion to void* is allowed. (Creating a misaligned __m512i* is not actually a problem in C terms, but you couldn't deref it normally so it's a weird-looking thing to do.) So there was cleanup work happening on intrinsic naming then, and I think this was part of it.

(AVX-512 also gave Intel the chance to introduce some fairly bad names, like _mm_loadu_epi32(const void*) - you'd guess that's a strict-aliasing-safe way to do a 32-bit movd load, right? No, unfortunately, it's an intrinsic for vmovdqu32 xmm, [mem] with no masking. It's just _mm_loadu_si128 with a different C type for the pointer arg. It's there for consistency with the naming pattern for _mm_maskz_loadu_epi32. It would be nice to have void* load / store intrinsics for __m128i and __m256i, but if they have misleading names like that (esp. when you aren't using the mask/maskz versions in nearby code), I'll just stick to those cumbersome _mm256_loadu_si256( (const __m256i*)(arr + i) ) casts for the old intrinsic, because I love typing 256 three times. >.<

I wish asm was more maintainable (or that intrinsics just used asm mnemonics) because it's much more concise; Intel generally does a good job naming their mnemonics.


It somewhat but not entirely helps to note the difference between epi16/32/64 and si128: EPI = Extended (SSE instead of MMX) Packed Integer. (Packed implying multiple SIMD elements). si128 means a whole 128-bit integer vector.

There's no way to infer from the name that you aren't just doing the same thing to a single 128-bit integer, instead of packed elements. You just have to know that there are no bit-granularity things that ever cross 64-bit boundaries, only SIMD shuffles (which work in terms of bytes). This avoids the combinatorial explosion of building a really wide barrel shifter, or of carry propagation at such a long distance for a 128-bit add, or whatever.