Hash function for clustered integers

Question

I am trying to set up a hash table (in C++, using the unordered_map container) which is to contain 1875 integer items which are randomly distributed within the interval 0 to 4891. Now my problem is that the distribution within this interval is not uniform but rather looks like this:

where each of the 1875 random integers is plotted as a point with x corresponding to the integer value and y = 1 (so as to visualize the distribution).

Clearly the distribution is such that there are wide gaps where no random integer lie. If I use the identity function as my hash function (i.e. use the random integers themselves as hash values), I get 714 empty buckets, 814 buckets with a single element, 499 buckets with 2 elements and 21 buckets with 3 or more elements.

I'm using the Intel C++ compiler and it uses powers of 2 for the number of buckets in the hash table. In my case right now the hash table has 2^11 = 2048 buckets.

What would be a good hash function for this case? My understanding is that a good hash function in this case would get rid of these clustered integer numbers and shuffle them around in a more uniform distribution, but how could one achieve that?

So you only have 1800 integers? How about a sorted std::vector of 1800 Key value pairs and then binary searching through it? This will at least be worth measuring. — Baum mit Augen
Boost's flat_map (overview) (API docs) is a nice, full implementation of what @BaummitAugen suggests--its interface is like an unordered_map, but it's implemented like an ordered vector. — Mark Waterman
@user3208430, I'm coming around--here's a benchmark of the various associative containers, and unordered_map did very well compared to a flat_map for random searches (though a flat_map rocks for iteration). — Mark Waterman
The distribution of your bucket loads is actually better than that of any standard hashfunction. An option could be unsigned multiply by a (large) odd number. — wildplasser
How about having 4891 buckets, and using the identity function as hash function. It could take less space , since you need no pointers and overflow chains and all that stuff. (it is called indexing) And a 30% load factor is not so weird, if it guarantiees you perfect hashing ... — wildplasser

rts1 rts1 · Accepted Answer · 2015-11-12T21:48:16

I've found that Pearson's Hash Function is an excellent way to get randomness:

https://en.wikipedia.org/wiki/Pearson_hashing

Basically, the idea is that it generates a bunch of VERY random numbers into an Array of 256 bins by default, but you might be able to modify it to 1800 for you scenario. The important thing is that the array is small enough to fit into memory.

Hash function for clustered integers

3 Answers