I'm not sure how much, if any, you can get the compiler to do for you without manually vectorizing with intrinsics or a C++ vector-class wrapper (like Agner Fog's VCL, if your project's license is compatible with the GPL). There are some non-GPLed wrappers, too.
Cache-blocking a matrix multiply is a fine art (and will be important here), and it would be really nice if you could use Eigen's existing templates but with a different class that uses bitwise and
on integers, instead of multiply on floats. I'm not sure if this is possible.
I did some searching, and most of the literature about binary matrices is about producing a boolean result (including SO questions like this). A vector inner-product is done with AND as the multiply, but with XOR or OR as the addition, not popcount. Maybe there's a search-term I'm missing that describes "normal" matrices that just happen to be (0,1) matrices, but where the product won't be.
Since every millisecond matters, you're probably going to have to manually vectorize this.
It's not that vector-integer stuff is slow in general, it's just vector-integer multiply that's slow, especially compared to vector-float
FMA on recent x86 hardware (especially Intel, which has FP FMA throughput of 2x 256b vectors per clock on Haswell and later).
Since you don't need an actual multiply with boolean elements, just an AND (3 vectors per clock throughput), that's not a problem for you. The efficiency-gain from doing many more elements per vector should more than make up for any extra cost per-vector.
Of course, this assumes an integer matmul implementation using all the same cache-blocking and other optimizations as an equivalent FP matmul, and that's where the problem lies if you don't want to (or don't know how to) write it yourself, and can't find a library that will do it for you.
I'm just answering the question of how efficient it could be, with an optimal implementation. The answer to the title question is a very definite yes, it's a massive waste of time to use actual multiplication, especially with 32-bit elements.
Storage format options:
one element (0/1) per byte:
- 4x the density of
float
(cache footprint / memory bandwidth / elements per vector)
- easy to transpose with byte shuffles
- vertical ADD is easy, in case that matters (e.g. for vectorizing over an outer loop, and working on multiple rows or multiple columns at once. Can be good (avoiding horizontal sums at the end) if you have your data interleaved in a way that makes this work without extra shuffling.)
4 elements per byte, packed into the low nibble:
- 4x the density of separate bytes
- very efficient to popcount with AVX2
vpshufb
. With inputs hot in L1D cache, you could load/AND/accumulate-a-popcount with a throughput of 128 AND-result elements per clock cycle (per core), in theory. 4 fused-domain uops per clock saturates SKL/HSW front-end issue bandwidth of 4 per clock, and doesn't bottleneck on the 3 vector ALU ports, because one of the uops is a pure load. (The other load micro-fuses with the vpand
). When bottlenecked on L2 bandwidth (~one 32B load per cycle), runs at 64 elements per clock. See below.
- slower to create from integer or packed-bitmap (but not bad if you put bits into vectors in an interleaved order for efficient pack/unpack to in-order bytes, rather than forcing bits to be in order).
- hard to transpose (maybe worse than fully packed)
packed bits:
- 8x the density of separate bytes, 256 elements per AVX2 vector.
- can be created from vectors with
pmovmskb
for a non-interleaved storage order. (not very useful for creation on the fly, though, since that puts the result in an integer reg, not a vector. An interleaved bit-order is probably best, esp. for unpacking during a transpose).
- fairly efficient to popcount with AVX2: mask / shift+mask / 2x
vpshufb
. (9 fused-domain uops (8 vector-ALU uops) to AND + accumulate popcount for 256 elements (from 2 row/column vectors), vs. 8 uops (6 vector-ALU uops) for the 4-per-byte strategy (from 4 row/column vectors).) ALU port bottlenecks limit this to 96 elements per clock from L1D or L2. So this has about 1.5x the inner-product throughput of the pack4 strategy when it bottlenecks on L2 bandwidth, or 3/4 the throughput for data hot in L1D, in theory, counting just the inner loop. This is just the inner-product part, not accounting for different pack/unpack costs.
- hard to transpose (but maybe not horrible with
pmovmskb
to extract 1 bit from each byte and make them contiguous).
6 elements per bytes, 0xxx0xxx
(probably no advantages for this problem on HSW/SKL, but interesting to consider):
- 6x the density of separate bytes
- fairly easy to create from 0/1 bytes in an interleaved way, by shifting/ORing, same as the 4 bits per byte format.
optimized for efficient popcount with AVX2 vpshufb
. No need to mask before 2xvpshufb
, just 1 right-shift. (vpshufb
zeros the byte if the high bit is set, otherwise it uses the low nibble as an index. This is why it needs the masking.) Right shifting this format by 4 (vpsrld ymm0,4
) will still leave a zero in the high bit of every byte. Load+AND -> accumulate popcount is 7 fused-domain uops per vector (vmovdqa
/vpand ymm,[mem]
/vpsrld ymm,4
/2xvpshufb
/2xvpaddb
), only 6 of which need ALU ports. So HSW/SKL throughput is in theory 1 vector (of 192 elements) per 2 clocks, or 96 elements per clock. This requires an average load throughput of one 256b vector per clock, so it's right up against the L2 bandwidth bottleneck.
In theory it's the same as fully packed, but in practice it might be slightly faster or slower depending on which one schedules better (fewer AND/ADD uops stealing port 5 from shuffles, for example). Fully packed is probably more likely to come close to theoretical speed, because more of its uops can run on multiple ports. Out-of-order scheduling imperfections are less likely.
- The
pmovmskb
transpose trick doesn't work cleanly.
- Could be useful if we just needed
popcount(A[])
instead of popcount(A[] & B[])
. Or for a different microarchitecture where ALU vs. load throughput was different.
Another variation on this, 7 elements per byte could be popcounted with a single AVX512VBMI (Cannonlake?) vpermi2b
(_mm512_permutex2var_epi8
), where each index byte selects one of 128 bytes from the concatenation of two other registers. A shuffle that wide will probably be slow, but it will hopefully have better throughput than an AVX512 vpshufb
separate-nibble thing.
To count packed-8 with AVX512VBMI (but without AVX512VPOPCNTDQ), you could maybe use vpermi2b
to count the low 7, then shift+mask the top bit and just add it. (popcount of a single bit = that bit).
uint8_t
elements are easier to shuffle efficiently (since there are byte shuffles like vpshufb
), so may be worth considering if you have to transpose on the fly. Or only pack down to bits on the fly while transposing?
32-bit integers are also an option, but not a good option. Fewer elements per vector means fewer shuffle instructions in a transpose, but not by a factor of 4. The number of shuffles in a transpose may scale with something like log2(elements per vector).
This is also a big deal for cache footprint / memory bandwidth. The factor of 8 size difference can mean that doing a whole row or column only takes part of L1, instead of overflowing L1. So it can make cache-blocking easier / less important.
10k * 20k / 8 = 23.84MiB per matrix, using packed-bit elements. That's much larger than L2 cache (256kiB on Haswell, 1MiB on Skylake-AVX512), but will fit in L3 on many-core Xeon CPUs. But L3 is competitively shared by all cores (including other VMs in a cloud environment), and is much slower than L2. (Many-core Xeons like you will be running on in HPC / cloud systems have lower per-core memory bandwidth than quad-core desktops, because of higher latency to L3 cache with no increase in concurrency (see the "latency-bound platforms" section of this answer. It takes more cores to drive the same amount of memory bandwidth on a Xeon, even though the total throughput is higher. But if you can have each core mostly working out of its private L2, you gain a LOT.)
Adding up the AND results: You've arranged your loops so you need to reduce a single run of booleans to a count of the non-zeros. This is a good thing.
With 8-bit integer 0/1 elements, you can do up to 255 vpaddb
before an element could overflow. It has good throughput: 2 per clock on Haswell, 3 per clock on Skylake. With multiple accumulators, that covers a lot of vectors of AND results. Use vpsadbw
against an all-zero vector to horizontally add the bytes in a vector into 64-bit integers. Then combine your accumulators with vpaddq
, then horizontally sum it.
With packed bits, you just want to popcount the vectors of AND results. With AVX2 and your data already in vectors, you definitely want to use a
VPSHUFB
-based bit-slicing popcount. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)
You could consider packing your data 4 bits per byte, in the low nibble. That would mean one vpshufb
could count the bits in each byte of an AND result, without needing any shifting / masking. Inside the inner loop, you'd have 2 loads, vpand
, vpshufb
, vpaddb
. With proper unrolling, that should keep up with L1D load bandwidth of 2x 32B per clock, and saturate all three vector execution ports (on Haswell or Skylake). Break out of that every 128 or 255 vectors or something to accumulate the bytes of your accumulator(s) with vpsadbw
/vpaddq
. (But with cache-blocking, you probably want to break out that often anyway and do different work). So the inner-most loop should run at 4 elements per byte * 32B per vector = 128 elements per clock cycle, if you can arrange for it to read data that's hot in L1D cache. Expect about half that bandwidth from L2 cache on Haswell/Skylake, or much worse from L3 cache.
With uint8_t
elements that are 0 or 1, you can maybe use some integer multiply-add instructions. They're a bit weirdly designed, intended for different use-cases than FP FMA. They add horizontal pairs of multiply results, producing wider elements. VPMADDUBSW
widens from 8 to 16 bit elements, and would work well on 0s and 1s. Since each element can only be in the range 0..2, you can still horizontal-sum with vpsadbw
. But if you're going to vpsadbw
, this gains you nothing over vpand
. It would only be useful if you wanted to use vpaddw
to use 16-bit elements in your vector accumulator, instead of breaking out of a loop to avoid byte overflow. vpmaddubsw doesn't seem useful here, because
vpsadbw` is a better way to horizontally add bytes.
Converting 0/1 integers to bitmaps can be done efficiently with SSE/AVX: For 32-bit integer elements, vpslld ymm0, 31
to left-shift the relevant bit to the top of each element, then vmovmskps eax, ymm0
to get an 8-bit mask of the high byte of each 32-bit element. For 8-bit integer elements, vpslld ymm0, 7
/ vpmovmskb eax, ymm0
to do the same thing but for each byte, producing a 32-bit integer bitmap result. (Only the sign bit of each byte matters, so it's fine that there are no shift instructions with only 8 bit granularity. You don't need to do anything about the bits that carry into the next element.)
This isn't a very good method for using right away with vectors, because you end up with the results in integer registers. This isn't a great format to generate and use on the fly, but it is the most compact so it can make sense if you can keep matrices in this format long-term. (And if you'll be limited by memory bandwidth when loading them.)
Converting 32-bit integers to 8-bit: One ways is with 2x vpackssdw
+ vpacksswb
. Because those operate within the 128b lanes, your elements will end up reordered. But that's ok as long as it's the same ordering for every row/column. It's only a problem if you want to take a chunk of a row/column that doesn't start at a multiple of 32 elements. Another option here is to left-shift (by 8, by 16, and by 24), and OR vectors together. Actually, you can do the shifting for free by using an unaligned load offset by 1, 2, or 3 bytes.
static inline
__m256i load_interleave4x32(const int32_t *input) {
const char *p = (const char*)input;
__m256i t0 = _mm256_load_si256((const __m256i*)(p));
__m256i t1 = _mm256_load_si256((const __m256i*)(p+32*1-1)); // the 1/0 bits will be in the 2nd byte of each 32-bit element
__m256i t2 = _mm256_load_si256((const __m256i*)(p+32*2-2));
__m256i t3 = _mm256_load_si256((const __m256i*)(p+32*3-3));
return t0 | t1 | t2 | t3;
// or write this out with _mm256_or_si256, if you don't have overloaded operators like GNU C does.
// this should compile to 1 load and 3 vpor ymm0, [rdi+31] ... instructions.
}
Converting to half-packed 4 bits per byte: we can use the same idea as above. Get 4 vectors from load_interleave4x32
(or from an array of uint8_t
if you started with 8-bit elements). Left-shift them by 0, 1, 2, and 3 bits, and OR those all together. This interleaved bit-order is fine when we just need to AND a row/column and popcount the whole result, because order doesn't matter. This bit-order is fairly efficient to unpack back to in-order bytes, e.g. AND with set1_epi8(1)
will get you a vector of bytes.
You might use this as part of a transpose if you store your whole matrices in this format, or you could use this format to store temporary copies for a cache-blocked transpose. A matmul touches each row/column multiple times, so it may be worth doing extra work to make a compact format the first time when that lets you do 4x as much work per vector on subsequent passes.
With AVX512BW (Skylake-AVX512)
We really want to be doing the AND and popcnt with vectors, not with scalar integer, because the vectors are twice as wide as AVX2, so they pull further ahead of scalar popcnt
. (Even though Skylake-AVX512 shuts down the vector ALUs (but not scalar) on port 1 while running 512b instructions).
@Harold points out an interesting identity that lets us do 2/3rds as many vector popcounts, at the cost of extra integer operations.
popcnt(a) + popcnt(b) + popcnt(c)
= popcnt(a ^ b ^ c) + 2 * popcnt((a ^ b) & c | (a & b))
a ^ b ^ c
and (a ^ b) & c | (a & b)
can be done with one vpternlogd
each (since each have 3 boolean inputs). The 2*
is free if we use a separate pre-shifted vpshufb
LUT vector. See also this implementation that uses 30x vpternlogd
+ 1 vector popcnt to handle 16 vectors of 512b, with some cleanup at the end (only doing the 16*popcnt
counts inside the loop; everything else is chained).
This is very likely worth it for counting fully-packed 8 bits per byte elements, and makes that format a lot more attractive for AVX512, compared to less-dense formats optimized for popcounting without as much shifting/masking.
vpternlogd
can also be useful as a bit-blend instruction for transposes, if byte-granularity VPBLENDMB zmm{k1}, zmm, zmm
isn't fine-enough grained.
This might be worth it for AVX2 on some CPUs, maybe avoiding 1 out of every 4 or 5 vector popcounts rather than 1 of 3? Or it might not help at all if it just increases the total execution port pressure, and there wasn't a bottleneck on any specific one. It would be useful with scalar popcnt
instructions (maybe on CPUs without AVX2), because those do bottleneck on a single port on Intel CPUs.
We can turn uint8_t
boolean elements into non-interleaved bitmaps slightly more efficiently than AVX2 (without even needing a shift), and do the reverse much more efficiently. Test-into-mask or compare-into-mask against a vector of set1_epi8(1) would both do the job, producing 64 bits of mask from 64 bytes of input. Or with 32-bit integers to start with, producing 16 bits of mask at a time. You can efficiently concatenate those bits with kunpck
instructions.
_mm512_test_epi8_mask
(vptestmb
) is interesting: AND two vectors together, and produce a mask-register result of byte-elements that were true/false. But this isn't really what we want: if we're going to pack our bits, we want to do it as a pre-processing step on the input matrices, not on the fly while doing inner products.
bitmap -> vector of 0 / -1 is fast: __m512i _mm512_movm_epi8 (__mmask64 k)
(vpmovm2b
) does that in one instruction. You can subtract -1
instead of adding 1
, but you'd have to mask it before you can OR together multiple bits within a byte.
Without AVX512BW or AVX512DQ (Knight's Landing Xeon Phi), you don't have 512b vpshufb
so you can't vector popcnt as efficiently. There's an AVX512 popcnt extension for vector popcnt directly, but no hardware with it has even been announced yet. (AVX2 vpshufb ymm
is very slow on KNL, though: one per 12 cycles, and psadbw ymm
is 1 per 9 cycles, so even using 256b vectors is unattractive). You might use a bithack popcnt based on 32-bit integer elements, since that's just AND/shift/ADD. 32-bit elements will take fewer steps to popcnt than 64-bit, but are still big enough not to overflow for reasonable problem sizes (so you can defer a horizontal sum of the vector until outside a loop)
Given the choice of storage format, packing multiple bits per byte might not be a good idea for KNL, but single-byte integer elements are good. vpandd zmm
and vpaddd zmm
are both fast and part of AVX512F, and we can use them because we don't want to let our single-bytes overflow anyway. (Using a packed 32-bit add when we actually have 8-bit elements that won't carry into each other is a SWAR technique.) KNL has good memory bandwidth and poor instruction throughput relative to Skylake-AVX512, I think.
Transposing bits:
BMI2 _pdep_u64
might be useful here. It's a scalar instruction/intrinsic. If it makes bit-transpose a lot more efficient than unpacking to bytes, you'd probably want to store a block of transpose results before reloading it with vector loads for AND + count. (Reloading a vector right away after scalar stores will cause a store-forwarding stall.)
Another useful option is that vpmovmskb
can slice 32 bits out of a 32-byte vector, one per byte. This gives you a building block for a transpose, maybe combined with byte shuffles to get the bytes in the right order for it. For more, see this blog post, and also How would you transpose a binary matrix?.
Using this in a matmul
Some of your choices depend on what format your input data is in, and how often you will reuse the same matrices. If a matrix will be used multiple times, packing it down to 4 or 8 bits per byte ahead of time makes sense. (Or on the fly the first time it's used). Keeping a transposed copy of it may also make sense, especially if it will always be the side of the multiply that needs transposing. (If you sometimes need one way and sometimes the other, redoing on the fly might be better for L3 cache footprint. But these are big enough that you probably won't get a lot of L3 hits, so just keeping a transposed copy could be good.)
Or maybe even write out a transposed and non-transposed version while converting from your input format.
You will definitely want to cache-block the multiplies, so the same data is reused multiple times while hot in L1. I don't have anything useful to say about this off the top of my head. The same principles apply as when cache-blocking a normal FP matmul, so go read about that.
Comments on your C++ implementation:
Using a bitset &
for a whole column will put the values back in memory, and then you'll loop over them again in .count()
on the result. I doubt that the compiler will optimize this into a one-pass loop that uses a VPSHUFB
-based bit-slicing popcnt on each vector of VPAND
results, but that would be much better. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)
With your matrix sizes, at least that inner loop probably hits in L1D cache, but the extra load/store instructions from looping twice are more overhead and it also interferes with prefetch of the valuable data.
Getting compilers to efficiently popcnt a dynamically-sized bitmap (without manually vectorizing) is not easy. The only thing that doesn't suck is clang++ -stdlib=libc++
with vector<bool>
, which compiles std::count(v.begin(), v.end(), true);
to a vectorized vpshufb
+ vpsadbw
+ vpaddq
loop, which is quite good. It would be faster if it just used vpaddb
inside the unrolled loop and vpsadbw + vpaddq
once per iteration, but it's pretty good for auto-vectorized code.
g++'s vector<bool>
is also a bitmap, but std::count(v.begin(), v.end(), true);
is very bad: it uses a totally naive loop that tests 1 bit at a time. And it doesn't even do that efficiently. Same for clang++
with the default libstdc++
instead of the new libc++
.
boost::dynamic_bitset
has a .count()
member function, but it doesn't take advantage of the popcnt
instruction or AVX2. It does a byte-at-a-time LUT lookup. That's much better than std::count(vector<bool>)
without libc++, but it's not even close to good enough for HPC.
Here's the test code on the Godbolt compiler explorer, with gcc and clang asm output. All of them used -march=haswell
.
But unfortunately, there doesn't seem to be an efficient way to bitwise-AND two std::vector<bool>
. This answer shows how to get at the underlying implementation of g++'s libstdc++
vector<bool>
, but that code doesn't auto-vectorize. Doing the same thing for libc++
and tweaking it so it auto-vectorizes might let you get a good fraction of the performance possible with manual vectorization (except for transpose), but you'd probably have to keep your whole matrix in one vector<bool>
, because a vector of vectors is a bad extra level of indirection. If the transpose part of the problem is performance-critical too, using standard containers to get access to an efficient popcount is not going to solve the whole problem.
For std::bitset<1024*1024>.count()
, clang makes the same efficient AVX2 popcount with or without libc++
. g++ makes a scalar loop using the 64-bit popcnt
instruction, which (according to this) is somewhat faster than a good AVX2 popcnt for small bitsets, but somewhat slower for large bitsets, on Haswell and Skylake.
See also: On vector<bool>
— Howard Hinnant, for some commentary on the C++ standard library, and why an array of bits is a useful data structure, but vector<bool>
is a bad name for it. Also, some benchmarks for properly-optimized count/find_first/etc. on a bit-vector vs. a 1 bool
-per-byte bool[]
array, vs. a naive vector<bool>
(like you get from gcc and clang without libc++).
SSE and
(and maybe with using POPCNT) - if the compiler don't use these already... – gezavpshufb
method (LUT of 4-bit nibbles) than with 64-bitpopcnt
. – Peter Cordesstd::bitset.count()
with-march=native
. @geze:-march=native
enables-mpopcnt
on CPUs that support it. And gcc'sstd::bitset<64>
does usepopcnt
. – Peter Cordes-mpopcnt
is enabled indeed – NULL