My initial approach to this was similar to @Jason R's because that is how "normal" operations work, but most of these operations only care about the high bit -- ignoring all the other bits. Once I realized this, the _mm*_maskz_broadcast*_epi*(mask,__m128i)
series of functions made the most sense. You will need to enable -mavx512vl and -mavx512bw (gcc)
To get a vector with the highest bit of each byte set according to a mask:
/* convert 16 bit mask to __m128i control byte mask */
_mm_maskz_broadcastb_epi8((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m256i control byte mask */
_mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0))
/* convert 64 bit mask to __m512i control byte mask */
_mm512_maskz_broadcastb_epi8((__mmask64)mask,_mm_set1_epi32(~0))
To get a vector with the highest bit of each word set according to a mask:
/* convert 8 bit mask to __m128i control word mask */
_mm_maskz_broadcastw_epi16((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m256i control word mask */
_mm256_maskz_broadcastw_epi16((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m512i control word mask */
_mm512_maskz_broadcastw_epi16((__mmask32)mask,_mm_set1_epi32(~0))
To get a vector with the highest bit of each double word set according to a mask:
/* convert 8 bit mask to __m256i control mask */
_mm256_maskz_broadcastd_epi32((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m512i control mask */
_mm512_maskz_broadcastd_epi32((__mmask16)mask,_mm_set1_epi32(~0))
To get a vector with the highest bit of each quad word set according to a mask:
/* convert 8 bit mask to __m512i control mask */
_mm512_maskz_broadcastq_epi64((__mmask8)mask,_mm_set1_epi32(~0))
The one specific to this question is: _mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0))
but I include the others for reference/comparison.
Note that each byte/word/... will either be all ones or all zeroes according to the mask (not just the highest bit). This can also be useful for doing vectorized bit operations (&'ing with another vector for instance to zero out unwanted bytes/words).
Another note: each _mm_set1_epi32(~0)
could/should be converted to a constant (either manually or by the compiler), so it should compile to just one fairly quick operation, though it may be slightly faster in testing than in real life since the constant will likely stay in a register. Then these are converted to VPMOVM2{b,w,d,q} instructions
Edit: In case your compiler doesn't support AVX512, the inline assembly version should look like:
inline __m256i dmask2epi8(__mmask32 mask){
__m256i ret;
__asm("vpmovm2b %1, %0":"=x"(ret):"k"(mask):);
return ret;
}
The other instructions are similar.
_mm256_mask_blend_epi8(__mmask32 k, __m256i a, __m256i b)
using your integer as the mask – technosaurusvpsllvd
variable-shift to put different bits of the mask in the sign bit of each element. This is great for an element size of 32b, but not for 8b. – Peter Cordes