I only just noticed the edit, which has a special-case answer.
If you need to do many different bit positions on the same data, then your current plan is good.
If you only need one bit position (esp. the highest bit position) from 128B of memory, you could use _mm256_movemask_ps
to get the high bit from each 32b element. Then combine four 8bit masks in GP registers.
A good compiler should optimize that to:
vmovdqu ymm0, [buf + 0]
; to select a different bit:
; vpslld ymm0, ymm0, count ; count can be imm8 or the low byte of an xmm register
vmovmskps eax, ymm0
vmovdqu ymm0, [buf + 32]
vmovmskps ebx, ymm0
... ecx and edx
mov ah, bl
mov ch, dl
shl ecx, 16
or eax, ecx
This is nice only if you're testing the high bit (so you don't need to shift each vector before vmovmsk
). Even so, this is probably more instructions (and code size) than the other solution.
Answer to the original question:
Similar to Elalfer's idea, but use the shuffle unit for pack
instructions instead of pshufb
. Also, all the ANDs are independent, so they can execute in parallel. Intel CPUs can do 3 ANDs at once, but only one shuffle. (Or two shuffles at once on pre-Haswell.)
// without AVX2: you won't really be able to
// do anything with a __m256i, only __m128i
// just convert everything to regular _mm_..., and leave out the final permute
mask = _mm256_set1_epi32(0x000000ff);
// same mask for all, and the load can fold into the AND
// You can write the load separately if you like, it'll still fold
L1 = and(mask, (buf)) // load and zero the bytes we don't want
L2 = and(mask, (buf+32))
L3 = and(mask, (buf+64))
L4 = and(mask, (buf+96))
// squish dwords from 2 concatenated regs down to words in 1 reg
pack12 = _mm256_packus_epi32(L1, L2);
pack34 = _mm256_packus_epi32(L3, L4);
packed = _mm256_packus_epi16(pack12, pack34); // note the different width: zero-padded-16 -> 8
Vec = permute(packed) // fix DWORD order in the vector (only needed for 256b version)
Vec = shift(Vec, bit_wanted)
bitvec = movemask(Vec)
// shift:
// I guess word or dword granularity is fine, since byte granularity isn't available.
// You only care about the high bit, so it doesn't matter than you're not shifting zeroes into the bottom of each byte.
// _mm_slli_epi32(Vec, imm8): 1 uop, 1c latency if your count is a compile-time constant.
// _mm_sll_epi32 (Vec, _mm_cvtsi32_si128(count)): 2uop 2c latency if it's variable.
// *not* _mm_sllv_epi32(): slower: different shift count for each element.
If you're doing this with just AVX (like you said) then you won't have 256b integer instructions available. Just build 128b vectors, and get 16b at a time of mask data. You won't need a final permute at the end.
Merge masks with integer instructions: (m2<<16) | m1
. If desired, even go up to 64b of mask data, by combining two 32b masks.
Performance: This avoids the need for separate load instructions with AVX, since vpand
can micro-fuse a memory operand if used with a one-register addressing mode.
- cycle 1: 3
vpand
instructions. (or only 2, if we were waiting on the address, since there's only 2 load ports.)
- cycle 2: last one or two
vpand
, one pack
(L1, L2)
- cycle 3: next
pack
(L3, L4)
- cycle 4: final
pack
- // 256b AVX2: permute
- cycle 5: packed shift with imm8 count: 1 uop, 1c latency.
- cycle 6: movemask (3 cycle latency)
Latency = 8 (SnB and later)
Throughput: 3 shuffles (p5), 4 logicals (p015), 1 shift (p0), 1 pmovmsk (p0). 4 load uops.
- SnB/IvB: 9 ALU uops -> 3c. 4 memory reads: 2c.
So depending on what you're doing with the masks, 3 accumulators would be needed to keep the execution ports saturated. (ceil(8/3) = 3.).
With shift count in a variable that can't be resolved to a compile-time constant by compiler inlining / unrolling: latency = 9. And the shift produces another uop for p1/p5.
With AVX2 for Haswell and later, there's another 3 extra latency for the vpermd
.