7
votes

In AVX there's only 128 bits PSHUFB

VPSHUFB xmm1, xmm2, xmm3/m128

and only AVX2 has the full PSHUFB for the whole 256 bits AVX registers

VPSHUFB ymm1, ymm2, ymm3/m256

How can this instruction be emulated efficiently with AVX intrinsics?

Also in this particular case, the source only has 8 elements (bytes) but those could be moved around within the full 32 bytes of the destination. So it's no problem for running just 2 x PSHUFB.

A problem I'm finding with VPSHUFB is it treats 16 (0x10) as 0, only 128 and up are filled with zero! (highest bit set) Is it possible to do it without adding compares and masking out?

1
Probably you didn't notice, but AVX2 VPSHUFB ymm, ymm, ymm/m256 is not really a 256-bit shuffle, instead it is 2x 128-bit shuffle.Marat Dukhan
Interesting. Thanks!alecco

1 Answers

10
votes

As @MaratDukhan has noticed, _mm256_shuffle_epi8 (i.e. VPSHUFB for ymm-s) does not perform full 32-byte shuffle. As for me, it is quite a pity...

That's why in order to emulate it without AVX2 you can simply split each register into two halves, permute each half, then combine together:

//AVX only
__m256i _emu_mm256_shuffle_epi8(__m256i reg, __m256i shuf) {
    __m128i reg0 = _mm256_castsi256_si128(reg);
    __m128i reg1 = _mm256_extractf128_si256(reg, 1);
    __m128i shuf0 = _mm256_castsi256_si128(shuf);
    __m128i shuf1 = _mm256_extractf128_si256(shuf, 1);
    __m128i res0 = _mm_shuffle_epi8(reg0, shuf0);
    __m128i res1 = _mm_shuffle_epi8(reg1, shuf1);
    __m256i res = _mm256_setr_m128i(res0, res1);
    return res;
}

If you really want to fully shuffle the 32-byte register, you can follow approach from this paper. Shuffle each half with each half, then blend results together. Without AVX2 it would be something like that:

//AVX only
__m256i _emu_mm256_shuffle32_epi8(__m256i reg, __m256i shuf) {
    __m128i reg0 = _mm256_castsi256_si128(reg);
    __m128i reg1 = _mm256_extractf128_si256(reg, 1);
    __m128i shuf0 = _mm256_castsi256_si128(shuf);
    __m128i shuf1 = _mm256_extractf128_si256(shuf, 1);
    __m128i res00 = _mm_shuffle_epi8(reg0, shuf0);
    __m128i res01 = _mm_shuffle_epi8(reg0, shuf1);
    __m128i res10 = _mm_shuffle_epi8(reg1, shuf0);
    __m128i res11 = _mm_shuffle_epi8(reg1, shuf1);
    __m128i res0 = _mm_blendv_epi8(res10, res00, _mm_cmplt_epi8(shuf0, _mm_set1_epi8(16)));
    __m128i res1 = _mm_blendv_epi8(res11, res01, _mm_cmplt_epi8(shuf1, _mm_set1_epi8(16)));
    __m256i res = _mm256_setr_m128i(res0, res1);
    return res;
}

If you know for sure that only the lower half of reg is used, then you can remove lines for reg1, res10, res11, and remove comparison and blending. Indeed, it might be more efficient to stick with SSE and use 128-bit registers if you have no AVX2.

The general 32-byte shuffling can be significantly optimized with AVX2:

//Uses AVX2
__m256i _ext_mm256_shuffle32_epi8(__m256i reg, __m256i shuf) {
    __m256i regAll0 = _mm256_permute2x128_si256(reg, reg, 0x00);
    __m256i regAll1 = _mm256_permute2x128_si256(reg, reg, 0x11);
    __m256i resR0 = _mm256_shuffle_epi8(regAll0, shuf);
    __m256i resR1 = _mm256_shuffle_epi8(regAll1, shuf);
    __m256i res = _mm256_blendv_epi8(resR1, resR0, _mm256_cmpgt_epi8(_mm256_set1_epi8(16), shuf));
    return res;
}

Beware: code not tested!