3
votes

I'm looking for a hash function that partitions a large set of input data with good uniformity to a small number of partitions (say 100 or 256). That means I expect a lot of collisions and I don't care about collisions.

The input data is not known in advance. I expect strings with a length between maybe 6 and 100 bytes. The strings may be very badly distributed (e.g. a large part filled with spaces or containing only digits).

CRC algorithms is one of the first ideas that springs into mind. CRC8 has been proposed, but without giving information about its uniformity; for CRC32 apparently the uniformity is not that good.

There are lists of simple or general purpose hash functions, but without telling about their uniformity.

Bob Jenkins has a thorough article on hash functions that return a 32 bit value. I suppose that for a uniformly distributed 32 bit value also all possible 8 bit subsets should be evenly distributed, so there are good candidates. But maybe it's overkill to reduce a 32 bit value to a 8 bit value if there are simpler algorithms for 8 bits?

1
Bernstein's hash wich is on Jenkins' page too is not at all bad, and it's dead simple. I'm using it everywhere when "just some hash" is needed. No issues whatsoever. You can even combine the additive and xor-variant into one if you are afraid it's not "random enough", this will usually pipeline to the same number of cycles. Note by the way that the design rationale of CRC is not primarily to produce a well-distributed hash, but to detect accidential bit flips.Damon
There is no penalty in computing using the machine's native register size (32 bits), most operations are performed after (sign) extending the operands to the native int size anyway. Truncating (or taking modulo something) will be cheap (but not be for free). Not all hashfunctions have enough entropy in the rightmost bits.wildplasser

1 Answers

0
votes

I found the sdbm algorithm to show good uniformity, being quite simple:

        h := 0.
        forEach ch in str {
            h := (h * 65599) + ch;
        }