Background
I've recently been taking some old code (~1998) and re-writing some of it to improve performance. Previously in the basic data structures for a state I stored elements in several arrays, and now I'm using raw bits (for the cases that requires less than 64 bits). That is, before I had an array of b
elements and now I have b
bits set in a single 64-bit integer that indicate whether that value is part of my state.
Using intrinsics like _pext_u64
and _pdep_u64
I've managed to get all operations 5-10x faster. I am working on the last operation, which has to do with computing a perfect hash function.
The exact details of the hash function aren't too important, but it boils down to computing binomial coefficients (n choose k
- n!/((n-k)!k!)
for various n
and k
. My current code uses a large lookup table for this, which is probably hard to speed up significantly on its own (except for possible cache misses in the table which I haven't measured).
But, I was thinking that with SIMD instructions I might be able to directly compute these for several states in parallel, and thus see an overall performance boost.
Some constraints:
- There are always exactly
b
bits set in each 64-bit state (representing small numbers). - The
k
value in the binomial coefficients is related tob
and changes uniformly in the calculation. These values are small (most of the time <= 5). - The final hash will be < 15 million (easily fits in 32 bits).
So, I can fairly easily write out the math for doing this in parallel and for keeping all operations as integer multiple/divide without remainders while keeping within 32 bits. The overall flow is:
- Extract the bits into values suitable for SIMD instructions.
- Perform the
n choose k
computation in a way to avoid overflow. - Extract out the final hash value from each entry
But, I haven't written SIMD code before, so I'm still getting up to speed on all the functions available and their caveats/efficiencies.
Example:
Previously I would have had my data in an array, supposing there are always 5 elements:
[3 7 19 31 38]
Now I'm using a single 64-bit value for this:
0x880080088
This makes many other operations very efficient. For the perfect hash I need to compute something like this efficiently (using c
for choose):
(50c5)-(38c5) + (37c4)-(31c4) + (30c3)-(19c3) + ...
But, in practice I have a bunch of these to compute, just with slightly different values:
(50c5)-(Xc5) + ((X-1)c4)-(Yc4) + ((Y-1)c3)-(Zc3) + ...
All the X/Y/Z... will be different but the form of the calculation is identical for each.
Questions:
Is my intuition on gaining efficiency by converting to SIMD operations reasonable? (Some sources suggest "no", but that's the problem of computing a single coefficient, not doing several in parallel.)
Is there something more efficient than repeated
_tzcnt_u64
calls for extracting bits into the data structures for SIMD operations? (For instance, I could temporarily break my 64-bit state representation into 32-bit chunks if it would help, but then I wouldn't be guaranteed to have the same number of bits set in each element.)What are the best intrinsics for computing several sequential multiply/divide operations for the binomial coefficients when I know there won't be overflow. (When I look through the Intel references I have trouble interpreting the naming quickly when going through all the variants - it isn't clear that what I want is available.)
If directly computing the coefficients is unlikely to be efficient, can SIMD instructions be used for parallel lookups into my previous lookup table of coefficients?
(I apologize for putting several questions together, but given the specific context, I thought it would be better to put them together as one.)
__m128i
variable. If you need wider intermediate precision then yes the first step is often something likepmovzxbd
or something (_mm256_cvtepu8_epi32
) – Peter Cordesk
is always small then the divisiors are effectively constants. – Nate Eldredge