How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

Question

Looking through the intel intrinsics guide, I saw this instruction. Looking through the naming pattern, the meaning should be clear: "Shift 128-bit register left by a fixed number of bits", but it is not. In actuality it shifts by a fixed number of bytes, which makes it exactly the same as _mm_bslli_si128.

Is this an oversight? Shouldn't it be shifting by bits like _mm_slli_epi32 or _mm_slli_epi64?
If not, in which situation should I use this over _mm_bslli_si128?
Is there an assembly instruction which does this correctly?
What is the best way of emulating this with smaller shifts?

My comparison of older and newer documentaton suggests that the instruction (V)PSLLDQ, which shifts byte-wise, was first exposed via an inconsistently named intrinsic (using "slli", incorrectly suggesting bit shift), while the consistently named intrinsic (using "bslli", correctly suggesting byte shift) wasn't added until much later, at which point it was not possible to remove the old intrinsic without breaking existing code. For new code, use of the "bslli" variant therefore seems preferable as the more appropriately named intrinsics. — njuffa
I sort of suspected it to be a historical artifact, but your comment confirms that — lennartVH01

Soonts Soonts · Accepted Answer · 2021-02-07T18:54:12

1 that’s not an oversight. That instruction indeed shifts by bytes, i.e. multiples of 8 bits.

2 doesn’t matter, _mm_slli_si128 and _mm_bslli_si128 are equivalents, both compile into pslldq SSE2 instruction.

As for the emulation, I’d do it like that, assuming you have C++/17. If you’re writing C++/14, replace if constexpr with normal if, also add a message to the static_assert.

template<int i>
inline __m128i shiftLeftBits( __m128i vec )
{
    static_assert( i >= 0 && i < 128 );
    // Handle couple trivial cases
    if constexpr( 0 == i )
        return vec;
    if constexpr( 0 == ( i % 8 ) )
        return _mm_slli_si128( vec, i / 8 );

    if constexpr( i > 64 )
    {
        // Shifting by more than 8 bytes, the lowest half will be all zeros
        vec = _mm_slli_si128( vec, 8 );
        return _mm_slli_epi64( vec, i - 64 );
    }
    else
    {
        // Shifting by less than 8 bytes.
        // Need to propagate a few bits across 64-bit lanes.
        __m128i low = _mm_slli_si128( vec, 8 );
        __m128i high = _mm_slli_epi64( vec, i );
        low = _mm_srli_epi64( low, 64 - i );
        return _mm_or_si128( low, high );
    }
}

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

2 Answers

`bslli` / `bsrli` are new, clearer intrinsic names for the same asm instructions

How to best emulate the logical meaning of _mm_slli_si128 (128-bit bit-shift), not _mm_bslli_si128

2 Answers

bslli / bsrli are new, clearer intrinsic names for the same asm instructions

`bslli` / `bsrli` are new, clearer intrinsic names for the same asm instructions