Parallel binomial coefficients using SIMD instructions

Question

Background

I've recently been taking some old code (~1998) and re-writing some of it to improve performance. Previously in the basic data structures for a state I stored elements in several arrays, and now I'm using raw bits (for the cases that requires less than 64 bits). That is, before I had an array of b elements and now I have b bits set in a single 64-bit integer that indicate whether that value is part of my state.

Using intrinsics like _pext_u64 and _pdep_u64 I've managed to get all operations 5-10x faster. I am working on the last operation, which has to do with computing a perfect hash function.

The exact details of the hash function aren't too important, but it boils down to computing binomial coefficients (n choose k - n!/((n-k)!k!) for various n and k. My current code uses a large lookup table for this, which is probably hard to speed up significantly on its own (except for possible cache misses in the table which I haven't measured).

But, I was thinking that with SIMD instructions I might be able to directly compute these for several states in parallel, and thus see an overall performance boost.

Some constraints:

There are always exactly b bits set in each 64-bit state (representing small numbers).
The k value in the binomial coefficients is related to b and changes uniformly in the calculation. These values are small (most of the time <= 5).
The final hash will be < 15 million (easily fits in 32 bits).

So, I can fairly easily write out the math for doing this in parallel and for keeping all operations as integer multiple/divide without remainders while keeping within 32 bits. The overall flow is:

Extract the bits into values suitable for SIMD instructions.
Perform the n choose k computation in a way to avoid overflow.
Extract out the final hash value from each entry

But, I haven't written SIMD code before, so I'm still getting up to speed on all the functions available and their caveats/efficiencies.

Example:

Previously I would have had my data in an array, supposing there are always 5 elements:

[3 7 19 31 38]

Now I'm using a single 64-bit value for this:

0x880080088

This makes many other operations very efficient. For the perfect hash I need to compute something like this efficiently (using c for choose):

(50c5)-(38c5) + (37c4)-(31c4) + (30c3)-(19c3) + ...

But, in practice I have a bunch of these to compute, just with slightly different values:

(50c5)-(Xc5) + ((X-1)c4)-(Yc4) + ((Y-1)c3)-(Zc3) + ...

All the X/Y/Z... will be different but the form of the calculation is identical for each.

Questions:

Is my intuition on gaining efficiency by converting to SIMD operations reasonable? (Some sources suggest "no", but that's the problem of computing a single coefficient, not doing several in parallel.)
Is there something more efficient than repeated _tzcnt_u64 calls for extracting bits into the data structures for SIMD operations? (For instance, I could temporarily break my 64-bit state representation into 32-bit chunks if it would help, but then I wouldn't be guaranteed to have the same number of bits set in each element.)
What are the best intrinsics for computing several sequential multiply/divide operations for the binomial coefficients when I know there won't be overflow. (When I look through the Intel references I have trouble interpreting the naming quickly when going through all the variants - it isn't clear that what I want is available.)
If directly computing the coefficients is unlikely to be efficient, can SIMD instructions be used for parallel lookups into my previous lookup table of coefficients?

(I apologize for putting several questions together, but given the specific context, I thought it would be better to put them together as one.)

Can we assume AVX2 (and hence the availability of gathered loads) ? — Paul R
Is using a different hash function an option? SIMD Integer division is not available on x86, except via multiplicative inverses (efficient for constant divisors) or conversion to/from float or double. — Peter Cordes
Extract the bits into values suitable for SIMD instructions. This is the wrong way to think about SIMD. When you load a 64-bit integer into a SIMD vector, it already is a vector of 8x 8-bit integers, and of 4x 16-bit integers, and so. You can use any element-width instructions you want on a __m128i variable. If you need wider intermediate precision then yes the first step is often something like pmovzxbd or something (_mm256_cvtepu8_epi32) — Peter Cordes
If k is always small then the divisiors are effectively constants. — Nate Eldredge
Or are you saying the values are variable-length groups of bits that you need to parse iteratively to find out where one ends and the next one starts? Then yes you might need a scalar loop. I think at least some (pseudo)code for at least a scalar version would help; I'm really not groking what operations you need to speed up. Probably libdivide.com can help for 16 or 32-bit integer SIMD division by small constants. (Same method as Why does GCC use multiplication by a strange number in implementing integer division?) — Peter Cordes

Nathan S. Nathan S. · Accepted Answer · 2020-05-21T17:57:09

Here is one possible solution that does the computation from a lookup table using one state at a time. It's probably going to be more efficient to do this in parallel over several states instead of using a single state. Note: This is hard-coded for the fixed case of getting combinations of 6 elements.

int64_t GetPerfectHash2(State &s)
{
    // 6 values will be used
    __m256i offsetsm1 = _mm256_setr_epi32(6*boardSize-1,5*boardSize-1,
                                          4*boardSize-1,3*boardSize-1,
                                          2*boardSize-1,1*boardSize-1,0,0);
    __m256i offsetsm2 = _mm256_setr_epi32(6*boardSize-2,5*boardSize-2,
                                          4*boardSize-2,3*boardSize-2,
                                          2*boardSize-2,1*boardSize-2,0,0);
    int32_t index[9];
    uint64_t value = _pext_u64(s.index2, ~s.index1);
    index[0] = boardSize-numItemsSet+1;
    for (int x = 1; x < 7; x++)
    {
        index[x] = boardSize-numItemsSet-_tzcnt_u64(value);
        value = _blsr_u64(value);
    }
    index[8] = index[7] = 0;

    // Load values and get index in table
    __m256i firstLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[0]), offsetsm2);
    __m256i secondLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[1]), offsetsm1);
    // Lookup in table
    __m256i values1 = _mm256_i32gather_epi32(combinations, firstLookup, 4);
    __m256i values2 = _mm256_i32gather_epi32(combinations, secondLookup, 4);
    // Subtract the terms
    __m256i finalValues = _mm256_sub_epi32(values1, values2);
    _mm256_storeu_si256((__m256i*)index, finalValues);

    // Extract out final sum
    int64_t result = 0;
    for (int x = 0; x < 6; x++)
    {
        result += index[x];
    }
    return result;  
}

Note that I actually have two similar cases. In the first case I don't need the _pext_u64 and this code is ~3x slower than my existing code. In the second case I need it, and it is 25% faster.

Parallel binomial coefficients using SIMD instructions

1 Answers