inline uint8_t pack8bools(bool* a)
{
uint64_t t;
memcpy(&t, a, sizeof t); // strict-aliasing & alignment safe load
return 0x8040201008040201ULL*t >> 56;
// bit order: a[0]<<7 | a[1]<<6 | ... | a[7]<<0 on little-endian
// for a[0] => LSB, use 0x0102040810204080ULL on little-endian
}
void unpack8bools(uint8_t b, bool* a)
{
// on little-endian, a[0] = (b>>7) & 1 like printing order
auto MAGIC = 0x8040201008040201ULL; // for opposite order, byte-reverse this
auto MASK = 0x8080808080808080ULL;
uint64_t t = ((MAGIC*b) & MASK) >> 7;
memcpy(a, &t, sizeof t); // store 8 bytes without UB
}
Assuming sizeof(bool) == 1
To portably do LSB <-> a[0]
(like the pext/pdep
version below) instead of using the opposite of host endianness, use htole64(0x0102040810204080ULL)
as the magic multiplier in both versions. (htole64
is from BSD / GNU <endian.h>). That arranges the multiplier bytes to match little-endian order for the bool array. htobe64
with the same constant gives the other order, MSB-first like you'd use for printing a number in base 2.
You may want to make sure that the bool array is 8-byte aligned (alignas(8)
) for performance, and that the compiler knows this. memcpy
is always safe for any alignment, but on ISAs that require alignment, a compiler can only inline memcpy
as a single load or store instruction if it knows the pointer is sufficiently aligned. *(uint64_t*)a
would promise alignment, but also violate the strict-aliasing rule. Even on ISAs that allow unaligned loads, they can be faster when naturally aligned. But the compiler can still inline memcpy without seeing that guarantee at compile time.
How they work
Suppose we have 8 bools b[0]
to b[7]
whose least significant bits are named a-h respectively that we want to pack into a single byte. Treating those 8 consecutive bool
s as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So the magic number for packing would be 0b1000000001000000001000000001000000001000000001000000001000000001
or 0x8040201008040201
. If you're on a big endian machine you'll need to use the magic number 0x0102040810204080
which is calculated in a similar manner
For unpacking we can do a similar multiplication
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
abcdefgh
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
────────────────────────────────────────────────────────────────
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000
After multiplying we have the needed bits at the most significant positions, so we need to mask out irrelevant bits and shift the remaining ones to the least significant positions. The output will be the bytes contain a to h in little endian.
The efficient way
On newer x86 CPUs with BMI2 there are PEXT and PDEP instructions for this purpose. The pack8bools
function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And the unpack8bools
function can be implemented as
_pdep_u64(b, 0x0101010101010101ULL);
(This maps LSB -> LSB, like a 0x0102040810204080ULL
multiplier constant, opposite of 0x8040201008040201ULL
. x86 is little-endian: a[0] = (b>>0) & 1;
after memcpy.)
Unfortunately those instructions are very slow on AMD so you may need to compare with the multiplication method above to see which is better