You could try doing it with SSE, incrementing 4 elements per iteration.
Warning: untested code follows...
uint32_t bit_counter[64] __attribute__ ((aligned(16)));
// make sure bit_counter array is 16 byte aligned for SSE
void Count_SSE(uint64 bits)
{
const __m128i inc_table[16] = {
_mm_set_epi32(0, 0, 0, 0),
_mm_set_epi32(0, 0, 0, 1),
_mm_set_epi32(0, 0, 1, 0),
_mm_set_epi32(0, 0, 1, 1),
_mm_set_epi32(0, 1, 0, 0),
_mm_set_epi32(0, 1, 0, 1),
_mm_set_epi32(0, 1, 1, 0),
_mm_set_epi32(0, 1, 1, 1),
_mm_set_epi32(1, 0, 0, 0),
_mm_set_epi32(1, 0, 0, 1),
_mm_set_epi32(1, 0, 1, 0),
_mm_set_epi32(1, 0, 1, 1),
_mm_set_epi32(1, 1, 0, 0),
_mm_set_epi32(1, 1, 0, 1),
_mm_set_epi32(1, 1, 1, 0),
_mm_set_epi32(1, 1, 1, 1)
};
for (int i = 0; i < 64; i += 4)
{
__m128i vbit_counter = _mm_load_si128(&bit_counter[i]);
// load 4 ints from bit_counter
int index = (bits >> i) & 15; // get next 4 bits
__m128i vinc = inc_table[index]; // look up 4 increments from LUT
vbit_counter = _mm_add_epi32(vbit_counter, vinc);
// increment 4 elements of bit_counter
_mm_store_si128(&bit_counter[i], vbit_counter);
} // store 4 updated ints
}
How it works: essentially all we are doing here is vectorizing the original loop so that we process 4 bits per loop iteration instead of 1. So we now have 16 loop iterations instead of 64. For each iteration we load 4 bits from bits
, then use them as an index into a LUT which contains all possible combinations of 4 increments for the current 4 bits. We then add these 4 increments to the current 4 elements of bit_counter.
The number of loads and stores and adds is reduced by a factor of 4, but this will be offset somewhat by the LUT load and other housekeeping. You may still see a 2x speed up though. I'd be interested to know the result if you do decide to try it.
Count
function prettier:for(int i=0; i < 64; ++i) bit_counter[i] += (bits >> i) & 1;
. – Xeogcc 4.2
andLLVM
– the wolf