Popcount of SSE vectors for binary correlation?

Question

I have this simple binary correlation method, It beats table lookup and Hakmem bit twiddling methods by x3-4 and %25 better than GCC's __builtin_popcount (which I think maps to a popcnt instruction when SSE4 is enabled.)

Here is the much simplified code:

int correlation(uint64_t *v1, uint64_t *v2, int size64) {
  __m128i* a = reinterpret_cast<__m128i*>(v1);
  __m128i* b = reinterpret_cast<__m128i*>(v2);
  int count = 0;
  for (int j = 0; j < size64 / 2; ++j, ++a, ++b) {
    union { __m128i s; uint64_t b[2]} x;
    x.s = _mm_xor_si128(*a, *b);
    count += _mm_popcnt_u64(x.b[0]) +_mm_popcnt_u64(x.b[1]);
  }
  return count;
}

I tried unrolling the loop, but I think GCC already automatically does this, so I ended up with same performance. Do you think performance further improved without making the code too complicated? Assume v1 and v2 are of the same size and size is even.

I am happy with its current performance but I was just curious to see if it could be further improved.

Thanks.

Edit: Fixed an error in union and it turned out this error was making this version faster than builtin __builtin_popcount , anyway I modified the code again, it is again slightly faster than builtin now (15%) but I don't think it is worth investing worth time on this. Thanks for all comments and suggestions.

for (int j = 0; j < size64 / 4; ++j, a+=2, b+=2) {
  __m128i x0 = _mm_xor_si128(_mm_load_si128(a), _mm_load_si128(b));
  count += _mm_popcnt_u64(_mm_extract_epi64(x0, 0))
        +_mm_popcnt_u64(_mm_extract_epi64(x0, 1));
  __m128i x1 = _mm_xor_si128(_mm_load_si128(a + 1), _mm_load_si128(b + 1));
  count += _mm_popcnt_u64(_mm_extract_epi64(x1, 0))
        +_mm_popcnt_u64(_mm_extract_epi64(x1, 1));
}

Second Edit: turned out that builtin is the fastest, sigh. especially with -funroll-loops and -fprefetch-loop-arrays args. Something like this:

for (int j = 0; j < size64; ++j) {
  count += __builtin_popcountll(a[j] ^ b[j]);
}

Third Edit:

This is an interesting SSE3 parallel 4 bit lookup algorithm. Idea is from Wojciech Muła, implementation is from Marat Dukhan's answer. Thanks to @Apriori for reminding me this algorithm. Below is the heart of the algorithm, it is very clever, basically counts bits for bytes using a SSE register as a 16 way lookup table and lower nibbles as index of which table cells are selected. Then sums the counts.

static inline __m128i hamming128(__m128i a, __m128i b) {
  static const __m128i popcount_mask = _mm_set1_epi8(0x0F);
  static const __m128i popcount_table = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
  const __m128i x = _mm_xor_si128(a, b);
  const __m128i pcnt0 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(x, popcount_mask));
  const __m128i pcnt1 = _mm_shuffle_epi8(popcount_table, _mm_and_si128(_mm_srli_epi16(x, 4), popcount_mask));
  return _mm_add_epi8(pcnt0, pcnt1);
}

On my tests this version is on par; slightly faster on smaller input, slightly slower on larger ones than using hw popcount. I think this should really shine if it is implemented in AVX. But I don't have time for this, if anyone is up to it would love to hear their results.

These are probably also already done by the compiler. But you could pre-calculate size64/2 and move all allocations (j and x) outside of the loop. Depending on the number of variables whose bits you're counting, you might be able to use (google for) popcount using SSE registers instead of integer registers. As always, you should be checking performance yourself. — inetknght
That union trick is likely to result in bad code. Even if it doesn't go through a memory temporary (but it likely will), it will still be bad. — harold
@inetknght: Thanks, I am only looking for ideas. As you guessed those optimisations are made by compiler and taking union out actually hurts performance. Also, I am already using SSE registers. — mdakin
@harold I actually tried to use __128i variables, _mm_load_si128 to load values and _mm_extract_epi64 to access 64 bit parts of registers. That version ended up worse. — mdakin
If you're willing to delve into assembly (are you really trying to edge that much performance out?), this seems like a nifty article: wm.ite.pl/articles/sse-popcount.html — inetknght

Stefan Stefan · Accepted Answer · 2015-01-01T17:22:20

The problem is that popcnt (which is what __builtin_popcnt compiles to on intel CPU's) operates on the integer registers. This causes the compiler to issue instructions to move data between the SSE and integer registers. I'm not surprised that the non-sse version is faster since the ability to move data between the vector and integer registers is quite limited/slow.

uint64_t count_set_bits(const uint64_t *a, const uint64_t *b, size_t count)
{
    uint64_t sum = 0;
    for(size_t i = 0; i < count; i++) {
        sum += popcnt(a[i] ^ b[i]);
    }
    return sum;
}

This runs at approx. 2.36 clocks per loop on small data sets (fits in cache). I think it run's slow because of the 'long' dependency chain on sum which restricts the CPU's ability to handle more things out of order. We can improve it by manually pipelining the loop:

uint64_t count_set_bits_2(const uint64_t *a, const uint64_t *b, size_t count)
{
    uint64_t sum = 0, sum2 = 0;

    for(size_t i = 0; i < count; i+=2) {
        sum  += popcnt(a[i  ] ^ b[i  ]);
        sum2 += popcnt(a[i+1] ^ b[i+1]);
    }
    return sum + sum2;
}

This runs at 1.75 clocks per item. My CPU is an Sandy Bridge model (i7-2820QM fixed @ 2.4Ghz).

How about four-way pipelining? That's 1.65 clocks per item. What about 8-way? 1.57 clocks per item. We can derive that the runtime per item is (1.5n + 0.5) / n where n is the amount of pipelines in our loop. I should note that for some reason 8-way pipelining performs worse than the others when the dataset grows, i have no idea why. The generated code looks okay.

Now if you look carefully there is one xor, one add, one popcnt, and one mov instruction per item. There is also one lea instruction per loop (and one branch and decrement, which i'm ignoring because they're pretty much free).

$LL3@count_set_:
; Line 50
    mov rcx, QWORD PTR [r10+rax-8]
    lea rax, QWORD PTR [rax+32]
    xor rcx, QWORD PTR [rax-40]
    popcnt  rcx, rcx
    add r9, rcx
; Line 51
    mov rcx, QWORD PTR [r10+rax-32]
    xor rcx, QWORD PTR [rax-32]
    popcnt  rcx, rcx
    add r11, rcx
; Line 52
    mov rcx, QWORD PTR [r10+rax-24]
    xor rcx, QWORD PTR [rax-24]
    popcnt  rcx, rcx
    add rbx, rcx
; Line 53
    mov rcx, QWORD PTR [r10+rax-16]
    xor rcx, QWORD PTR [rax-16]
    popcnt  rcx, rcx
    add rdi, rcx
    dec rdx
    jne SHORT $LL3@count_set_

You can check with Agner Fog's optimization manual that an lea is half a clock cycle in throughout and the mov/xor/popcnt/add combo is apparently 1.5 clock cycles, although i don't fully understand why exactly.

Unfortunately, I think we're stuck here. The PEXTRQ instruction is what's usually used to move data from the vector registers to the integer registers and we can fit this instruction and one popcnt instruction neatly in one clock cycle. Add one integer add instruction and our pipeline is at minimum 1.33 cycles long and we still need to add an vector load and xor in there somewhere... If intel offered instructions to move multiple registers between the vector and integer registers at once it would be a different story.

I don't have an AVX2 cpu at hand (xor on 256-bit vector registers is an AVX2 feature), but my vectorized-load implementation performs quite poorly with low data sizes and reached an minimum of 1.97 clock cycles per item.

For reference these are my benchmarks: enter image description here

"pipe 2", "pipe 4" and "pipe 8" are 2, 4 and 8-way pipelined versions of the code shown above. The poor showing of "sse load" appears to be a manifestation of the lzcnt/tzcnt/popcnt false dependency bug which gcc avoided by using the same register for input and output. "sse load 2" follows below:

uint64_t count_set_bits_4sse_load(const uint64_t *a, const uint64_t *b, size_t count)
{
    uint64_t sum1 = 0, sum2 = 0;

    for(size_t i = 0; i < count; i+=4) {
        __m128i tmp = _mm_xor_si128(
                    _mm_load_si128(reinterpret_cast<const __m128i*>(a + i)),
                    _mm_load_si128(reinterpret_cast<const __m128i*>(b + i)));
        sum1 += popcnt(_mm_extract_epi64(tmp, 0));
        sum2 += popcnt(_mm_extract_epi64(tmp, 1));
        tmp = _mm_xor_si128(
                    _mm_load_si128(reinterpret_cast<const __m128i*>(a + i+2)),
                    _mm_load_si128(reinterpret_cast<const __m128i*>(b + i+2)));
        sum1 += popcnt(_mm_extract_epi64(tmp, 0));
        sum2 += popcnt(_mm_extract_epi64(tmp, 1));
    }

    return sum1 + sum2;
}

Popcount of SSE vectors for binary correlation?

2 Answers