How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

Question

Update: Please read the code, it is NOT about counting bits in one int

Is it possible to improve performance of the following code with some clever assembler?

uint bit_counter[64];

void Count(uint64 bits) {
  bit_counter[0] += (bits >> 0) & 1;
  bit_counter[1] += (bits >> 1) & 1;
  // ..
  bit_counter[63] += (bits >> 63) & 1;
}

Count is in the inner-most loop of my algorithm.

Update: Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.

bits variable has almost random bits (close to half zeros and half ones)

You can, at the very least, make that Count function prettier: for(int i=0; i < 64; ++i) bit_counter[i] += (bits >> i) & 1;. — Xeo
How wide do the bit counters have to be? Must they be uints or is narrower allowed? — harold
@Lukasz: Are you sure? I'd leave the loop unrolling to the compiler. Since the bounds of the for loop are compile time constants, I'm pretty sure that can and will be optimized by a good compiler. — Xeo
@Xeo: Time it. Prettier (in this case) is significantly slower. Łukasz Lew's manual loop unrolling is faster -- at least with gcc 4.2 and LLVM — the wolf

Paul R Paul R · Accepted Answer · 2011-10-17T13:44:58

You could try doing it with SSE, incrementing 4 elements per iteration.

Warning: untested code follows...

#include <stdint.h>
#include <emmintrin.h>

uint32_t bit_counter[64] __attribute__ ((aligned(16)));
                     // make sure bit_counter array is 16 byte aligned for SSE

void Count_SSE(uint64 bits)
{
    const __m128i inc_table[16] = {
        _mm_set_epi32(0, 0, 0, 0),
        _mm_set_epi32(0, 0, 0, 1),
        _mm_set_epi32(0, 0, 1, 0),
        _mm_set_epi32(0, 0, 1, 1),
        _mm_set_epi32(0, 1, 0, 0),
        _mm_set_epi32(0, 1, 0, 1),
        _mm_set_epi32(0, 1, 1, 0),
        _mm_set_epi32(0, 1, 1, 1),
        _mm_set_epi32(1, 0, 0, 0),
        _mm_set_epi32(1, 0, 0, 1),
        _mm_set_epi32(1, 0, 1, 0),
        _mm_set_epi32(1, 0, 1, 1),
        _mm_set_epi32(1, 1, 0, 0),
        _mm_set_epi32(1, 1, 0, 1),
        _mm_set_epi32(1, 1, 1, 0),
        _mm_set_epi32(1, 1, 1, 1)
    };

    for (int i = 0; i < 64; i += 4)
    {
        __m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
                                          // load 4 ints from bit_counter
        int index = (bits >> i) & 15;     // get next 4 bits
        __m128i vinc = inc_table[index];  // look up 4 increments from LUT
        vbit_counter = _mm_add_epi32(vbit_counter, vinc);
                                          // increment 4 elements of bit_counter
        _mm_store_si128(&bit_counter[i], vbit_counter);
    }                                     // store 4 updated ints
}

How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.

The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.

How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

9 Answers