113
votes

What integer hash function are good that accepts an integer hash key?

11

11 Answers

171
votes

I found the following algorithm provides a very good statistical distribution. Each input bit affects each output bit with about 50% probability. There are no collisions (each input results in a different output). The algorithm is fast except if the CPU doesn't have a built-in integer multiplication unit. C code, assuming int is 32 bit (for Java, replace >> with >>> and remove unsigned):

unsigned int hash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = (x >> 16) ^ x;
    return x;
}

The magic number was calculated using a special multi-threaded test program that ran for many hours, which calculates the avalanche effect (the number of output bits that change if a single input bit is changed; should be nearly 16 on average), independence of output bit changes (output bits should not depend on each other), and the probability of a change in each output bit if any input bit is changed. The calculated values are better than the 32-bit finalizer used by MurmurHash, and nearly as good (not quite) as when using AES. A slight advantage is that the same constant is used twice (it did make it slightly faster the last time I tested, not sure if it's still the case).

You can reverse the process (get the input value from the hash) if you replace the 0x45d9f3b with 0x119de1f3 (the multiplicative inverse):

unsigned int unhash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = (x >> 16) ^ x;
    return x;
}

For 64-bit numbers, I suggest to use the following, even thought it might not be the fastest. This one is based on splitmix64, which seems to be based on the blog article Better Bit Mixing (mix 13).

uint64_t hash(uint64_t x) {
    x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
    x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
    x = x ^ (x >> 31);
    return x;
}

For Java, use long, add L to the constant, replace >> with >>> and remove unsigned. In this case, reversing is more complicated:

uint64_t unhash(uint64_t x) {
    x = (x ^ (x >> 31) ^ (x >> 62)) * UINT64_C(0x319642b2d24d8ec3);
    x = (x ^ (x >> 27) ^ (x >> 54)) * UINT64_C(0x96de1b173f119089);
    x = x ^ (x >> 30) ^ (x >> 60);
    return x;
}

Update: You may also want to look at the Hash Function Prospector project, where other (possibly better) constants are listed.

52
votes

Knuth's multiplicative method:

hash(i)=i*2654435761 mod 2^32

In general, you should pick a multiplier that is in the order of your hash size (2^32 in the example) and has no common factors with it. This way the hash function covers all your hash space uniformly.

Edit: The biggest disadvantage of this hash function is that it preserves divisibility, so if your integers are all divisible by 2 or by 4 (which is not uncommon), their hashes will be too. This is a problem in hash tables - you can end up with only 1/2 or 1/4 of the buckets being used.

30
votes

Depends on how your data is distributed. For a simple counter, the simplest function

f(i) = i

will be good (I suspect optimal, but I can't prove it).

16
votes

Fast and good hash functions can be composed from fast permutations with lesser qualities, like

  • multiplication with an uneven integer
  • binary rotations
  • xorshift

To yield a hashing function with superior qualities, like demonstrated with PCG for random number generation.

This is in fact also the recipe rrxmrrxmsx_0 and murmur hash are using, knowingly or unknowingly.

I personally found

uint64_t xorshift(const uint64_t& n,int i){
  return n^(n>>i);
}
uint64_t hash(const uint64_t& n){
  uint64_t p = 0x5555555555555555ull; // pattern of alternating 0 and 1
  uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
  return c*xorshift(p*xorshift(n,32),32);
}

to be good enough.

A good hash function should

  1. be bijective to not loose information, if possible and have the least collisions
  2. cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5.

Let's first look at the identity function. It satisfies 1. but not 2. :

identity function

Input bit n determines output bit n with a correlation of 100% (red) and no others, they are therefore blue, giving a perfect red line across.

A xorshift(n,32) is not much better, yielding one and half a line. Still satisfying 1., because it is invertible with a second application.

xorshift

A multiplication with an unsigned integer ("Knuth's multiplicative method") is much better, cascading more strongly and flipping more output bits with a probability of 0.5, which is what you want, in green. It satisfies 1. as for each uneven integer there is a multiplicative inverse.

knuth

Combining the two gives the following output, still satisfying 1. as the composition of two bijective functions yields another bijective function.

knuth•xorshift

A second application of multiplication and xorshift will yield the following:

proposed hash

Or you can use Galois field multiplications like GHash, they have become reasonably fast on modern CPUs and have superior qualities in one step.

   uint64_t const inline gfmul(const uint64_t& i,const uint64_t& j){           
     __m128i I{};I[0]^=i;                                                          
     __m128i J{};J[0]^=j;                                                          
     __m128i M{};M[0]^=0xb000000000000000ull;                                      
     __m128i X = _mm_clmulepi64_si128(I,J,0);                                      
     __m128i A = _mm_clmulepi64_si128(X,M,0);                                      
     __m128i B = _mm_clmulepi64_si128(A,M,0);                                      
     return A[0]^A[1]^B[1]^X[0]^X[1];                                              
   }
7
votes
  • 32-bits multiplicative method (very fast) see @rafal

    #define hash32(x) ((x)*2654435761)
    #define H_BITS 24 // Hashtable size
    #define H_SHIFT (32-H_BITS)
    unsigned hashtab[1<<H_BITS]  
    .... 
    unsigned slot = hash32(x) >> H_SHIFT
    
  • 32-bits and 64-bits (good distribution) at : MurmurHash

  • Integer Hash Function
6
votes

This page lists some simple hash functions that tend to decently in general, but any simple hash has pathological cases where it doesn't work well.

3
votes

There's a nice overview over some hash algorithms at Eternally Confuzzled. I'd recommend Bob Jenkins' one-at-a-time hash which quickly reaches avalanche and therefore can be used for efficient hash table lookup.

2
votes

The answer depends on a lot of things like:

  • Where do you intend to employ it?
  • What are you trying to do with the hash?
  • Do you need a crytographically secure hash function?

I suggest that you take a look at the Merkle-Damgard family of hash functions like SHA-1 etc

1
votes

I don't think we can say that a hash function is "good" without knowing your data in advance ! and without knowing what you're going to do with it.

There are better data structures than hash tables for unknown data sizes (I'm assuming you're doing the hashing for a hash table here ). I would personally use a hash table when I Know I have a "finite" number of elements that are needing stored in a limited amount of memory. I would try and do a quick statistical analysis on my data, see how it is distributed etc before I start thinking about my hash function.

1
votes

For random hash values, some engineers said golden ratio prime number(2654435761) is a bad choice, with my testing results, I found that it's not true; instead, 2654435761 distributes the hash values pretty good.

#define MCR_HashTableSize 2^10

unsigned int
Hash_UInt_GRPrimeNumber(unsigned int key)
{
  key = key*2654435761 & (MCR_HashTableSize - 1)
  return key;
}

The hash table size must be a power of two.

I have written a test program to evaluate many hash functions for integers, the results show that GRPrimeNumber is a pretty good choice.

I have tried:

  1. total_data_entry_number / total_bucket_number = 2, 3, 4; where total_bucket_number = hash table size;
  2. map hash value domain into bucket index domain; that is, convert hash value into bucket index by Logical And Operation with (hash_table_size - 1), as shown in Hash_UInt_GRPrimeNumber();
  3. calculate the collision number of each bucket;
  4. record the bucket that has not been mapped, that is, an empty bucket;
  5. find out the max collision number of all buckets; that is, the longest chain length;

With my testing results, I found that Golden Ratio Prime Number always has the fewer empty buckets or zero empty bucket and the shortest collision chain length.

Some hash functions for integers are claimed to be good, but the testing results show that when the total_data_entry / total_bucket_number = 3, the longest chain length is bigger than 10(max collision number > 10), and many buckets are not mapped(empty buckets), which is very bad, compared with the result of zero empty bucket and longest chain length 3 by Golden Ratio Prime Number Hashing.

BTW, with my testing results, I found one version of shifting-xor hash functions is pretty good(It's shared by mikera).

unsigned int Hash_UInt_M3(unsigned int key)
{
  key ^= (key << 13);
  key ^= (key >> 17);    
  key ^= (key << 5); 
  return key;
}
1
votes

I have been using splitmix64 (pointed in Thomas Mueller's answer) ever since I found this thread. However, I recently stumbled upon Pelle Evensen's rrxmrrxmsx_0, which yielded tremendously better statistical distribution than the original MurmurHash3 finalizer and its successors (splitmix64 and other mixes). Here is the code snippet in C:

#include <stdint.h>

static inline uint64_t ror64(uint64_t v, int r) {
    return (v >> r) | (v << (64 - r));
}

uint64_t rrxmrrxmsx_0(uint64_t v) {
    v ^= ror64(v, 25) ^ ror64(v, 50);
    v *= 0xA24BAED4963EE407UL;
    v ^= ror64(v, 24) ^ ror64(v, 49);
    v *= 0x9FB21C651E98DF25UL;
    return v ^ v >> 28;
}

Pelle also provides an in-depth analysis of the 64-bit mixer used in the final step of MurmurHash3 and the more recent variants.