9
votes

For multiplying large binary matrices (10Kx20K), what I usually to do is to convert the matrices to float ones and perform float matrix multiplication as integer matrix multiplication is pretty slow (have a look at here).

This time though, I'd need to perform over hundred thousands of these multiplications and even a millisecond performance improvement on average matters to me.


I want an int or float matrix as a result, because the product may have elements that aren't 0 or 1. The input matrix elements are all 0 or 1, so they can be stored as single bits.

In the inner-product between a row vector and a column vector (to produce one element of the output matrix), multiplication simplifies to bitwise AND. Addition is still addition, but we can add bits with a population-count function instead of looping over them individually.

Some other boolean/binary-matrix functions OR the bits instead of counting them, producing a bit-matrix result, but that's not what I need.


Here is a sample code showing that forming the problem as std::bitset, AND and count operations is faster than matrix multiplication.

#include <iostream>
using std::cout; using std::endl;
#include <vector>
    using std::vector;
#include <chrono>
#include <Eigen/Dense>
    using Eigen::Map; using Eigen::Matrix; using Eigen::MatrixXf;
#include <random>
    using std::random_device; using std::mt19937; using std::uniform_int_distribution;
#include <bitset>
    using std::bitset;

using std::floor;

const int NROW = 1000;
const int NCOL = 20000;

const float DENSITY = 0.4;
const float DENOMINATOR = 10.0 - (10*DENSITY);

void fill_random(vector<float>& vec) {
    random_device rd;
    mt19937 eng(rd());
    uniform_int_distribution<> distr(0, 10);
    int nnz = 0;
    for (int i = 0; i < NROW*NCOL; ++i)
        vec.push_back(floor(distr(eng)/DENOMINATOR));
}

void matmul(vector<float>& vec){
    float *p = vec.data();
    MatrixXf A = Eigen::Map<Eigen::Matrix<float, NROW, NCOL, Eigen::RowMajor>>(p);
    cout << "Eigen matrix has " << A.rows() << " rows and " << A.cols() << " columns." << endl;
    cout << "Total non-zero values : " << A.sum() << endl;
    cout << "The density of non-zero values is " <<  A.sum() * 1.0 / (A.cols()*A.rows()) << endl;

    auto start = std::chrono::steady_clock::now();
    MatrixXf B = A.transpose() * A;
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "Mat mul took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "Eigen coo ";
    for (int i=0; i<10; ++i)
        cout << B(0,i) << " ";
    cout << endl;
}


void bitset_op(vector<float>& vec) {
    // yeah it's not a great idea to set size at compile time but have to
    vector<bitset<NROW>> col_major(NCOL);

    // right, multiple par for isn't a good idea, maybe in a parallel block
    // Doing this for simplicity to profile second loop timing 
    // converting row major float vec to col major bool vec
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int i=0; i < NROW; ++i) {
            col_major[j].set(i, vec[i*NCOL + j] && 1);
        }
    }

    auto start = std::chrono::steady_clock::now();
    vector<int> coo;
    coo.assign(NCOL*NCOL, 0);
    #pragma omp parallel for
    for (int j=0; j < NCOL; ++j) {
        for (int k=0; k<NCOL; ++k) {
            coo[j*NCOL + k] = (col_major[j]&col_major[k]).count();
        }
    }
    auto end = (std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start)).count();
    cout << "bitset intersection took " << end << " ms"<< endl;

    // Just to make sure the operation is not skipped by compiler
    cout << "biset coo ";
    for (int i=0; i<10; ++i)
        cout << coo[i] << " ";
    cout << endl;
}


int main() {
    // Saving to float instead of int to speed up matmul
    vector<float> vec;
    fill_random(vec);
    matmul(vec);
    bitset_op(vec);
}

Running this with:

g++ -O3 -fopenmp -march=native -I. -std=c++11 code.cpp -o code

I get:

Eigen matrix has 1000 rows and 20000 columns.
Total non-zero values : 9.08978e+06
The density of non-zero values is 0.454489
Mat mul took 1849 ms
Eigen coo 458 206 208 201 224 205 204 199 217 210
bitset intersection took 602 ms
biset coo 458 206 208 201 224 205 204 199 217 210

As you can see, matmul as set of bitset operations is about 3x faster than Eigen's float matmul, which makes sense.

I want to emphasize that I need to perform this operation over 100K (in the HPC or cloud) and a millisecond performance improvement on average would make a difference.

I'm not bound to any specific library, C++ standard, etc. So please feel free to answer with any solution that you think is faster other than those using GPU, as I can't use it for a number of reasons.

1
I think that you can make a significantly faster version with using SSE and (and maybe with using POPCNT) - if the compiler don't use these already...geza
Do you have AVX2 available (Intel Haswell or later)? I'm assuming Intel since that's pretty much standard for HPC/cloud stuff these days, but let us know if you're on AMD. On Intel, pop-counting a large array is faster with the AVX2 vpshufb method (LUT of 4-bit nibbles) than with 64-bit popcnt.Peter Cordes
Hopefully your compiler is picking the optimal strategy when you compile std::bitset.count() with -march=native. @geze: -march=native enables -mpopcnt on CPUs that support it. And gcc's std::bitset<64> does use popcnt.Peter Cordes
@PeterCordes yes, I do have AVX2 available. I'm mostly using Google cloud and it's easy to get newer architectures as well.NULL
@geza -mpopcnt is enabled indeedNULL

1 Answers

14
votes

I'm not sure how much, if any, you can get the compiler to do for you without manually vectorizing with intrinsics or a C++ vector-class wrapper (like Agner Fog's VCL, if your project's license is compatible with the GPL). There are some non-GPLed wrappers, too.

Cache-blocking a matrix multiply is a fine art (and will be important here), and it would be really nice if you could use Eigen's existing templates but with a different class that uses bitwise and on integers, instead of multiply on floats. I'm not sure if this is possible.

I did some searching, and most of the literature about binary matrices is about producing a boolean result (including SO questions like this). A vector inner-product is done with AND as the multiply, but with XOR or OR as the addition, not popcount. Maybe there's a search-term I'm missing that describes "normal" matrices that just happen to be (0,1) matrices, but where the product won't be.

Since every millisecond matters, you're probably going to have to manually vectorize this.


It's not that vector-integer stuff is slow in general, it's just vector-integer multiply that's slow, especially compared to vector-float FMA on recent x86 hardware (especially Intel, which has FP FMA throughput of 2x 256b vectors per clock on Haswell and later).

Since you don't need an actual multiply with boolean elements, just an AND (3 vectors per clock throughput), that's not a problem for you. The efficiency-gain from doing many more elements per vector should more than make up for any extra cost per-vector.

Of course, this assumes an integer matmul implementation using all the same cache-blocking and other optimizations as an equivalent FP matmul, and that's where the problem lies if you don't want to (or don't know how to) write it yourself, and can't find a library that will do it for you.

I'm just answering the question of how efficient it could be, with an optimal implementation. The answer to the title question is a very definite yes, it's a massive waste of time to use actual multiplication, especially with 32-bit elements.


Storage format options:

one element (0/1) per byte:

  • 4x the density of float (cache footprint / memory bandwidth / elements per vector)
  • easy to transpose with byte shuffles
  • vertical ADD is easy, in case that matters (e.g. for vectorizing over an outer loop, and working on multiple rows or multiple columns at once. Can be good (avoiding horizontal sums at the end) if you have your data interleaved in a way that makes this work without extra shuffling.)

4 elements per byte, packed into the low nibble:

  • 4x the density of separate bytes
  • very efficient to popcount with AVX2 vpshufb. With inputs hot in L1D cache, you could load/AND/accumulate-a-popcount with a throughput of 128 AND-result elements per clock cycle (per core), in theory. 4 fused-domain uops per clock saturates SKL/HSW front-end issue bandwidth of 4 per clock, and doesn't bottleneck on the 3 vector ALU ports, because one of the uops is a pure load. (The other load micro-fuses with the vpand). When bottlenecked on L2 bandwidth (~one 32B load per cycle), runs at 64 elements per clock. See below.
  • slower to create from integer or packed-bitmap (but not bad if you put bits into vectors in an interleaved order for efficient pack/unpack to in-order bytes, rather than forcing bits to be in order).
  • hard to transpose (maybe worse than fully packed)

packed bits:

  • 8x the density of separate bytes, 256 elements per AVX2 vector.
  • can be created from vectors with pmovmskb for a non-interleaved storage order. (not very useful for creation on the fly, though, since that puts the result in an integer reg, not a vector. An interleaved bit-order is probably best, esp. for unpacking during a transpose).
  • fairly efficient to popcount with AVX2: mask / shift+mask / 2xvpshufb. (9 fused-domain uops (8 vector-ALU uops) to AND + accumulate popcount for 256 elements (from 2 row/column vectors), vs. 8 uops (6 vector-ALU uops) for the 4-per-byte strategy (from 4 row/column vectors).) ALU port bottlenecks limit this to 96 elements per clock from L1D or L2. So this has about 1.5x the inner-product throughput of the pack4 strategy when it bottlenecks on L2 bandwidth, or 3/4 the throughput for data hot in L1D, in theory, counting just the inner loop. This is just the inner-product part, not accounting for different pack/unpack costs.
  • hard to transpose (but maybe not horrible with pmovmskb to extract 1 bit from each byte and make them contiguous).

6 elements per bytes, 0xxx0xxx (probably no advantages for this problem on HSW/SKL, but interesting to consider):

  • 6x the density of separate bytes
  • fairly easy to create from 0/1 bytes in an interleaved way, by shifting/ORing, same as the 4 bits per byte format.
  • optimized for efficient popcount with AVX2 vpshufb. No need to mask before 2xvpshufb, just 1 right-shift. (vpshufb zeros the byte if the high bit is set, otherwise it uses the low nibble as an index. This is why it needs the masking.) Right shifting this format by 4 (vpsrld ymm0,4) will still leave a zero in the high bit of every byte. Load+AND -> accumulate popcount is 7 fused-domain uops per vector (vmovdqa/vpand ymm,[mem]/vpsrld ymm,4/2xvpshufb/2xvpaddb), only 6 of which need ALU ports. So HSW/SKL throughput is in theory 1 vector (of 192 elements) per 2 clocks, or 96 elements per clock. This requires an average load throughput of one 256b vector per clock, so it's right up against the L2 bandwidth bottleneck.

    In theory it's the same as fully packed, but in practice it might be slightly faster or slower depending on which one schedules better (fewer AND/ADD uops stealing port 5 from shuffles, for example). Fully packed is probably more likely to come close to theoretical speed, because more of its uops can run on multiple ports. Out-of-order scheduling imperfections are less likely.

  • The pmovmskb transpose trick doesn't work cleanly.
  • Could be useful if we just needed popcount(A[]) instead of popcount(A[] & B[]). Or for a different microarchitecture where ALU vs. load throughput was different.

Another variation on this, 7 elements per byte could be popcounted with a single AVX512VBMI (Cannonlake?) vpermi2b (_mm512_permutex2var_epi8), where each index byte selects one of 128 bytes from the concatenation of two other registers. A shuffle that wide will probably be slow, but it will hopefully have better throughput than an AVX512 vpshufb separate-nibble thing.

To count packed-8 with AVX512VBMI (but without AVX512VPOPCNTDQ), you could maybe use vpermi2b to count the low 7, then shift+mask the top bit and just add it. (popcount of a single bit = that bit).


uint8_t elements are easier to shuffle efficiently (since there are byte shuffles like vpshufb), so may be worth considering if you have to transpose on the fly. Or only pack down to bits on the fly while transposing?

32-bit integers are also an option, but not a good option. Fewer elements per vector means fewer shuffle instructions in a transpose, but not by a factor of 4. The number of shuffles in a transpose may scale with something like log2(elements per vector).

This is also a big deal for cache footprint / memory bandwidth. The factor of 8 size difference can mean that doing a whole row or column only takes part of L1, instead of overflowing L1. So it can make cache-blocking easier / less important.

10k * 20k / 8 = 23.84MiB per matrix, using packed-bit elements. That's much larger than L2 cache (256kiB on Haswell, 1MiB on Skylake-AVX512), but will fit in L3 on many-core Xeon CPUs. But L3 is competitively shared by all cores (including other VMs in a cloud environment), and is much slower than L2. (Many-core Xeons like you will be running on in HPC / cloud systems have lower per-core memory bandwidth than quad-core desktops, because of higher latency to L3 cache with no increase in concurrency (see the "latency-bound platforms" section of this answer. It takes more cores to drive the same amount of memory bandwidth on a Xeon, even though the total throughput is higher. But if you can have each core mostly working out of its private L2, you gain a LOT.)


Adding up the AND results: You've arranged your loops so you need to reduce a single run of booleans to a count of the non-zeros. This is a good thing.

With 8-bit integer 0/1 elements, you can do up to 255 vpaddb before an element could overflow. It has good throughput: 2 per clock on Haswell, 3 per clock on Skylake. With multiple accumulators, that covers a lot of vectors of AND results. Use vpsadbw against an all-zero vector to horizontally add the bytes in a vector into 64-bit integers. Then combine your accumulators with vpaddq, then horizontally sum it.

With packed bits, you just want to popcount the vectors of AND results. With AVX2 and your data already in vectors, you definitely want to use a VPSHUFB-based bit-slicing popcount. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

You could consider packing your data 4 bits per byte, in the low nibble. That would mean one vpshufb could count the bits in each byte of an AND result, without needing any shifting / masking. Inside the inner loop, you'd have 2 loads, vpand, vpshufb, vpaddb. With proper unrolling, that should keep up with L1D load bandwidth of 2x 32B per clock, and saturate all three vector execution ports (on Haswell or Skylake). Break out of that every 128 or 255 vectors or something to accumulate the bytes of your accumulator(s) with vpsadbw/vpaddq. (But with cache-blocking, you probably want to break out that often anyway and do different work). So the inner-most loop should run at 4 elements per byte * 32B per vector = 128 elements per clock cycle, if you can arrange for it to read data that's hot in L1D cache. Expect about half that bandwidth from L2 cache on Haswell/Skylake, or much worse from L3 cache.

With uint8_t elements that are 0 or 1, you can maybe use some integer multiply-add instructions. They're a bit weirdly designed, intended for different use-cases than FP FMA. They add horizontal pairs of multiply results, producing wider elements. VPMADDUBSW widens from 8 to 16 bit elements, and would work well on 0s and 1s. Since each element can only be in the range 0..2, you can still horizontal-sum with vpsadbw. But if you're going to vpsadbw, this gains you nothing over vpand. It would only be useful if you wanted to use vpaddw to use 16-bit elements in your vector accumulator, instead of breaking out of a loop to avoid byte overflow. vpmaddubsw doesn't seem useful here, becausevpsadbw` is a better way to horizontally add bytes.


Converting 0/1 integers to bitmaps can be done efficiently with SSE/AVX: For 32-bit integer elements, vpslld ymm0, 31 to left-shift the relevant bit to the top of each element, then vmovmskps eax, ymm0 to get an 8-bit mask of the high byte of each 32-bit element. For 8-bit integer elements, vpslld ymm0, 7 / vpmovmskb eax, ymm0 to do the same thing but for each byte, producing a 32-bit integer bitmap result. (Only the sign bit of each byte matters, so it's fine that there are no shift instructions with only 8 bit granularity. You don't need to do anything about the bits that carry into the next element.)

This isn't a very good method for using right away with vectors, because you end up with the results in integer registers. This isn't a great format to generate and use on the fly, but it is the most compact so it can make sense if you can keep matrices in this format long-term. (And if you'll be limited by memory bandwidth when loading them.)

Converting 32-bit integers to 8-bit: One ways is with 2x vpackssdw + vpacksswb. Because those operate within the 128b lanes, your elements will end up reordered. But that's ok as long as it's the same ordering for every row/column. It's only a problem if you want to take a chunk of a row/column that doesn't start at a multiple of 32 elements. Another option here is to left-shift (by 8, by 16, and by 24), and OR vectors together. Actually, you can do the shifting for free by using an unaligned load offset by 1, 2, or 3 bytes.

static inline
__m256i load_interleave4x32(const int32_t *input) {
  const char *p = (const char*)input;
  __m256i t0 = _mm256_load_si256((const __m256i*)(p));
  __m256i t1 = _mm256_load_si256((const __m256i*)(p+32*1-1));  // the 1/0 bits will be in the 2nd byte of each 32-bit element
  __m256i t2 = _mm256_load_si256((const __m256i*)(p+32*2-2));
  __m256i t3 = _mm256_load_si256((const __m256i*)(p+32*3-3));
  return t0 | t1 | t2 | t3;
  // or write this out with _mm256_or_si256, if you don't have overloaded operators like GNU C does.
  // this should compile to 1 load and 3 vpor ymm0, [rdi+31] ... instructions.
}

Converting to half-packed 4 bits per byte: we can use the same idea as above. Get 4 vectors from load_interleave4x32 (or from an array of uint8_t if you started with 8-bit elements). Left-shift them by 0, 1, 2, and 3 bits, and OR those all together. This interleaved bit-order is fine when we just need to AND a row/column and popcount the whole result, because order doesn't matter. This bit-order is fairly efficient to unpack back to in-order bytes, e.g. AND with set1_epi8(1) will get you a vector of bytes.

You might use this as part of a transpose if you store your whole matrices in this format, or you could use this format to store temporary copies for a cache-blocked transpose. A matmul touches each row/column multiple times, so it may be worth doing extra work to make a compact format the first time when that lets you do 4x as much work per vector on subsequent passes.


With AVX512BW (Skylake-AVX512)

We really want to be doing the AND and popcnt with vectors, not with scalar integer, because the vectors are twice as wide as AVX2, so they pull further ahead of scalar popcnt. (Even though Skylake-AVX512 shuts down the vector ALUs (but not scalar) on port 1 while running 512b instructions).

@Harold points out an interesting identity that lets us do 2/3rds as many vector popcounts, at the cost of extra integer operations.

   popcnt(a) + popcnt(b) + popcnt(c)
 = popcnt(a ^ b ^ c) + 2 * popcnt((a ^ b) & c | (a & b))

a ^ b ^ c and (a ^ b) & c | (a & b) can be done with one vpternlogd each (since each have 3 boolean inputs). The 2* is free if we use a separate pre-shifted vpshufb LUT vector. See also this implementation that uses 30x vpternlogd + 1 vector popcnt to handle 16 vectors of 512b, with some cleanup at the end (only doing the 16*popcnt counts inside the loop; everything else is chained).

This is very likely worth it for counting fully-packed 8 bits per byte elements, and makes that format a lot more attractive for AVX512, compared to less-dense formats optimized for popcounting without as much shifting/masking.

vpternlogd can also be useful as a bit-blend instruction for transposes, if byte-granularity VPBLENDMB zmm{k1}, zmm, zmm isn't fine-enough grained.

This might be worth it for AVX2 on some CPUs, maybe avoiding 1 out of every 4 or 5 vector popcounts rather than 1 of 3? Or it might not help at all if it just increases the total execution port pressure, and there wasn't a bottleneck on any specific one. It would be useful with scalar popcnt instructions (maybe on CPUs without AVX2), because those do bottleneck on a single port on Intel CPUs.


We can turn uint8_t boolean elements into non-interleaved bitmaps slightly more efficiently than AVX2 (without even needing a shift), and do the reverse much more efficiently. Test-into-mask or compare-into-mask against a vector of set1_epi8(1) would both do the job, producing 64 bits of mask from 64 bytes of input. Or with 32-bit integers to start with, producing 16 bits of mask at a time. You can efficiently concatenate those bits with kunpck instructions.

_mm512_test_epi8_mask (vptestmb) is interesting: AND two vectors together, and produce a mask-register result of byte-elements that were true/false. But this isn't really what we want: if we're going to pack our bits, we want to do it as a pre-processing step on the input matrices, not on the fly while doing inner products.

bitmap -> vector of 0 / -1 is fast: __m512i _mm512_movm_epi8 (__mmask64 k) (vpmovm2b) does that in one instruction. You can subtract -1 instead of adding 1, but you'd have to mask it before you can OR together multiple bits within a byte.

Without AVX512BW or AVX512DQ (Knight's Landing Xeon Phi), you don't have 512b vpshufb so you can't vector popcnt as efficiently. There's an AVX512 popcnt extension for vector popcnt directly, but no hardware with it has even been announced yet. (AVX2 vpshufb ymm is very slow on KNL, though: one per 12 cycles, and psadbw ymm is 1 per 9 cycles, so even using 256b vectors is unattractive). You might use a bithack popcnt based on 32-bit integer elements, since that's just AND/shift/ADD. 32-bit elements will take fewer steps to popcnt than 64-bit, but are still big enough not to overflow for reasonable problem sizes (so you can defer a horizontal sum of the vector until outside a loop)

Given the choice of storage format, packing multiple bits per byte might not be a good idea for KNL, but single-byte integer elements are good. vpandd zmm and vpaddd zmm are both fast and part of AVX512F, and we can use them because we don't want to let our single-bytes overflow anyway. (Using a packed 32-bit add when we actually have 8-bit elements that won't carry into each other is a SWAR technique.) KNL has good memory bandwidth and poor instruction throughput relative to Skylake-AVX512, I think.


Transposing bits:

BMI2 _pdep_u64 might be useful here. It's a scalar instruction/intrinsic. If it makes bit-transpose a lot more efficient than unpacking to bytes, you'd probably want to store a block of transpose results before reloading it with vector loads for AND + count. (Reloading a vector right away after scalar stores will cause a store-forwarding stall.)

Another useful option is that vpmovmskb can slice 32 bits out of a 32-byte vector, one per byte. This gives you a building block for a transpose, maybe combined with byte shuffles to get the bytes in the right order for it. For more, see this blog post, and also How would you transpose a binary matrix?.


Using this in a matmul

Some of your choices depend on what format your input data is in, and how often you will reuse the same matrices. If a matrix will be used multiple times, packing it down to 4 or 8 bits per byte ahead of time makes sense. (Or on the fly the first time it's used). Keeping a transposed copy of it may also make sense, especially if it will always be the side of the multiply that needs transposing. (If you sometimes need one way and sometimes the other, redoing on the fly might be better for L3 cache footprint. But these are big enough that you probably won't get a lot of L3 hits, so just keeping a transposed copy could be good.)

Or maybe even write out a transposed and non-transposed version while converting from your input format.

You will definitely want to cache-block the multiplies, so the same data is reused multiple times while hot in L1. I don't have anything useful to say about this off the top of my head. The same principles apply as when cache-blocking a normal FP matmul, so go read about that.


Comments on your C++ implementation:

Using a bitset & for a whole column will put the values back in memory, and then you'll loop over them again in .count() on the result. I doubt that the compiler will optimize this into a one-pass loop that uses a VPSHUFB-based bit-slicing popcnt on each vector of VPAND results, but that would be much better. (See http://wm.ite.pl/articles/sse-popcount.html for example. You'd want to write it with intrinsics, not asm, if you have to manually vectorize it.)

With your matrix sizes, at least that inner loop probably hits in L1D cache, but the extra load/store instructions from looping twice are more overhead and it also interferes with prefetch of the valuable data.


Getting compilers to efficiently popcnt a dynamically-sized bitmap (without manually vectorizing) is not easy. The only thing that doesn't suck is clang++ -stdlib=libc++ with vector<bool>, which compiles std::count(v.begin(), v.end(), true); to a vectorized vpshufb + vpsadbw + vpaddq loop, which is quite good. It would be faster if it just used vpaddb inside the unrolled loop and vpsadbw + vpaddq once per iteration, but it's pretty good for auto-vectorized code.

g++'s vector<bool> is also a bitmap, but std::count(v.begin(), v.end(), true); is very bad: it uses a totally naive loop that tests 1 bit at a time. And it doesn't even do that efficiently. Same for clang++ with the default libstdc++ instead of the new libc++.

boost::dynamic_bitset has a .count() member function, but it doesn't take advantage of the popcnt instruction or AVX2. It does a byte-at-a-time LUT lookup. That's much better than std::count(vector<bool>) without libc++, but it's not even close to good enough for HPC.

Here's the test code on the Godbolt compiler explorer, with gcc and clang asm output. All of them used -march=haswell.

But unfortunately, there doesn't seem to be an efficient way to bitwise-AND two std::vector<bool>. This answer shows how to get at the underlying implementation of g++'s libstdc++ vector<bool>, but that code doesn't auto-vectorize. Doing the same thing for libc++ and tweaking it so it auto-vectorizes might let you get a good fraction of the performance possible with manual vectorization (except for transpose), but you'd probably have to keep your whole matrix in one vector<bool>, because a vector of vectors is a bad extra level of indirection. If the transpose part of the problem is performance-critical too, using standard containers to get access to an efficient popcount is not going to solve the whole problem.

For std::bitset<1024*1024>.count(), clang makes the same efficient AVX2 popcount with or without libc++. g++ makes a scalar loop using the 64-bit popcnt instruction, which (according to this) is somewhat faster than a good AVX2 popcnt for small bitsets, but somewhat slower for large bitsets, on Haswell and Skylake.

See also: On vector<bool> — Howard Hinnant, for some commentary on the C++ standard library, and why an array of bits is a useful data structure, but vector<bool> is a bad name for it. Also, some benchmarks for properly-optimized count/find_first/etc. on a bit-vector vs. a 1 bool-per-byte bool[] array, vs. a naive vector<bool> (like you get from gcc and clang without libc++).