The first example at the end of this answer shows how to use the BMI2 instruction pdep
to compute the 8 byte array.
Note that on Intel Haswell processors and newer, the pdep
instruction has a throughput of one
instruction per cycle and a latency of 3 cycles, which is fast. On AMD Ryzen this instruction is
relatively slow unfortunately: both latency and throughput are 18 cycles.
For AMD Ryzen it is better to replace the pdep
instruction with a multiplication and a few bitwise operations, which are quite fast on AMD Ryzen, see the second example at the end of this answer.
See also here and here
for efficient inverse movemask computations, with a scalar source
and a 256 bit AVX2 vector destination.
Instead of working with 8 bits and 8 bytes at the time, it might be
more efficient to reorganize your algorithm to work with 4 x 8 bits and 4 x 8 bytes per step.
In that case the full AVx2 vector width of 256 bit can be utilized, which might be faster.
Peter Cordes shows that the pext
instruction can be used for the conversion in
the opposite direction: from 8 bytes to 8 bits.
Code example with the pdep
instruction:
/* gcc -O3 -Wall -m64 -march=skylake bytetoarr.c */
#include<stdint.h>
#include<stdio.h>
#include<x86intrin.h>
int main(){
int i;
union {
uint8_t a8[8];
uint64_t a64;
} t;
/* With mask = 0b0000000100......0100000001 = 0x0101010101010101 */
/* the input bits 0, 1, ..., 7 are expanded */
/* to the right positions of the uint64_t = 8 x uint8_t output */
uint64_t mask = 0x0101010101010101;
/* example input: */
uint8_t x = 0b01001100;
t.a64 = _pdep_u64(x,mask);
for (i = 0; i < 8; i++){
printf("a[%i] = %hhu\n", i, t.a8[i]);
}
}
The output is:
$ ./a.out
a[0] = 0
a[1] = 0
a[2] = 1
a[3] = 1
a[4] = 0
a[5] = 0
a[6] = 1
a[7] = 0
Code example for AMD Ryzen processors:
/* gcc -O3 -Wall -m64 -march=skylake bytetoarr_amd.c */
#include<stdint.h>
#include<stdio.h>
#include<x86intrin.h>
int main(){
int i;
union {
uint8_t a8[8];
uint64_t a64;
} t;
/* example input: */
uint8_t x = 0b01001100;
uint64_t x64 = x;
uint64_t x_hi = x64 & 0xFE; /* Unset the lowest bit. */
uint64_t r_hi = x_hi * 0b10000001000000100000010000001000000100000010000000; /* Copy the remaining 7 bits 7 times. */
uint64_t r = r_hi | x64; /* Merge the lowest bit into the result. */
t.a64= r & 0x0101010101010101 ; /* Mask off the bits at the unwanted positions. */
for (i = 0; i < 8; i++){
printf("a[%i] = %hhu\n", i, t.a8[i]);
}
}
pdep
instruction, or_pdep_u64
– wim