TL:DR: They're synonyms; the bslli
name is newer, introduced around the same time as new AVX-512 intrinsics (sometime before 2015, long after SSE2 _mm_slli_si128
was in widespread usage). I find it clearer and would recommend it for new development.
SSE/AVX2/AVX-512 do not have bit-shifts with element sizes wider than 64. (Or any other bit-granularity operation like add
, except pure-vertical bitwise boolean stuff that's really 128 fully separate operations, not one big wide one. Or for AVX-512 masking and broadcast-load purposes, can be in dword or qword chunks like _mm512_xor_epi32
/ vpxord
)
You have to emulate it somehow, which can be fairly efficient for compile-time-constant counts so you can pick between strategies according to c >= 64
, with special cases for c%8
reducing to a byte-shift. Existing SO Q&As cover that, or see @Soonts' answer on this Q.
Runtime-variable counts would suck; you'd have to branch or do both ways and blend, unlike for element bit-shifts where _mm_sll_epi64(v, _mm_cvtsi32_si128(i))
can compile to movd
/ psllq xmm, xmm
. Unfortunately, hardware variable-count versions of byte-shuffle/shift instructions don't exist, only for the bit-shift versions.
bslli
/ bsrli
are new, clearer intrinsic names for the same asm instructions
The b
names are supported in current version of all 4 major compilers for x86 (Godbolt), and I'd recommend them for new development unless you need backwards compat with crusty old compilers, or for some reason you like the old name that doesn't both to distinguish it from different operations. (e.g. familiarity; if you don't want people to have to look up this newfangled name in the manual.)
- gcc since 4.8
- clang since 3.7
- ICC since ICC13 or earlier, Godbolt doesn't have any older
- MSVC since 19.14 or earlier, Godbolt doesn't have any older
If you check the intrinsics guide, _mm_slli_si128
is listed as an intrinsic for PSLLDQ
, which is a byte shift. This is not a bug, just Intel's idea of a joke, or whatever process they used to choose names for intrinsics back in the SSE2 days. (There are only 2 hard problems in computer science: cache invalidation and naming things).
Asm mnemonics also use the same pattern of not making the byte-shuffle one look different from the bit-shifts. psllw xmm, 1
/ pslld
/ psllq
/ pslldq
. Again, you just have to know that 128-bit size is special, and must be a byte shuffle not a bit-shift, because x86 never has that. (Or you have to check the manual.)
The asm manual entry for pslldq
in turn lists intrinsics for forms of it, interestingly only using the b
name for the __m512i
AVX-512BW version. When SSE2 and AVX2 were new, _mm_slli_si128
and _mm256_slli_si256
were the only names available, I think. Certainly it post-dates SSE2 intrinsics.
(Note that the si256
and si512
versions are just 2 or 4 copies of the 16-byte operation, not shifting bytes across 128-bit lanes; something a few other Q&As have asked for. This often makes AVX2 versions of shuffles like this and palignr
a lot less useful than they'd otherwise be: either not worth using at all, or needing extra shuffles on top of it.)
I think this new bslli
name was introduced when AVX-512 was new. Intel invented some new names for other intrinsics around that time, and the AVX-512 load/store intrinsics take void*
instead of __m512i*
, which is a major improvement to amount of noise in code, especially for C where implicit conversion to void*
is allowed. (Creating a misaligned __m512i*
is not actually a problem in C terms, but you couldn't deref it normally so it's a weird-looking thing to do.) So there was cleanup work happening on intrinsic naming then, and I think this was part of it.
(AVX-512 also gave Intel the chance to introduce some fairly bad names, like _mm_loadu_epi32(const void*)
- you'd guess that's a strict-aliasing-safe way to do a 32-bit movd
load, right? No, unfortunately, it's an intrinsic for vmovdqu32 xmm, [mem]
with no masking. It's just _mm_loadu_si128
with a different C type for the pointer arg. It's there for consistency with the naming pattern for _mm_maskz_loadu_epi32
. It would be nice to have void*
load / store intrinsics for __m128i
and __m256i
, but if they have misleading names like that (esp. when you aren't using the mask
/maskz
versions in nearby code), I'll just stick to those cumbersome _mm256_loadu_si256( (const __m256i*)(arr + i) )
casts for the old intrinsic, because I love typing 256
three times. >.<
I wish asm was more maintainable (or that intrinsics just used asm mnemonics) because it's much more concise; Intel generally does a good job naming their mnemonics.
It somewhat but not entirely helps to note the difference between epi16/32/64
and si128
: EPI = Extended (SSE instead of MMX) Packed Integer. (Packed implying multiple SIMD elements). si128
means a whole 128-bit integer vector.
There's no way to infer from the name that you aren't just doing the same thing to a single 128-bit integer, instead of packed elements. You just have to know that there are no bit-granularity things that ever cross 64-bit boundaries, only SIMD shuffles (which work in terms of bytes). This avoids the combinatorial explosion of building a really wide barrel shifter, or of carry propagation at such a long distance for a 128-bit add, or whatever.
(V)PSLLDQ
, which shifts byte-wise, was first exposed via an inconsistently named intrinsic (using "slli", incorrectly suggesting bit shift), while the consistently named intrinsic (using "bslli", correctly suggesting byte shift) wasn't added until much later, at which point it was not possible to remove the old intrinsic without breaking existing code. For new code, use of the "bslli" variant therefore seems preferable as the more appropriately named intrinsics. – njuffa