2
votes

I am trying to set up a hash table (in C++, using the unordered_map container) which is to contain 1875 integer items which are randomly distributed within the interval 0 to 4891. Now my problem is that the distribution within this interval is not uniform but rather looks like this: enter image description here

where each of the 1875 random integers is plotted as a point with x corresponding to the integer value and y = 1 (so as to visualize the distribution).

Clearly the distribution is such that there are wide gaps where no random integer lie. If I use the identity function as my hash function (i.e. use the random integers themselves as hash values), I get 714 empty buckets, 814 buckets with a single element, 499 buckets with 2 elements and 21 buckets with 3 or more elements.

I'm using the Intel C++ compiler and it uses powers of 2 for the number of buckets in the hash table. In my case right now the hash table has 2^11 = 2048 buckets.

What would be a good hash function for this case? My understanding is that a good hash function in this case would get rid of these clustered integer numbers and shuffle them around in a more uniform distribution, but how could one achieve that?

3
So you only have 1800 integers? How about a sorted std::vector of 1800 Key value pairs and then binary searching through it? This will at least be worth measuring.Baum mit Augen
Boost's flat_map (overview) (API docs) is a nice, full implementation of what @BaummitAugen suggests--its interface is like an unordered_map, but it's implemented like an ordered vector.Mark Waterman
@user3208430, I'm coming around--here's a benchmark of the various associative containers, and unordered_map did very well compared to a flat_map for random searches (though a flat_map rocks for iteration).Mark Waterman
The distribution of your bucket loads is actually better than that of any standard hashfunction. An option could be unsigned multiply by a (large) odd number.wildplasser
How about having 4891 buckets, and using the identity function as hash function. It could take less space , since you need no pointers and overflow chains and all that stuff. (it is called indexing) And a 30% load factor is not so weird, if it guarantiees you perfect hashing ...wildplasser

3 Answers

1
votes

I've found that Pearson's Hash Function is an excellent way to get randomness:

https://en.wikipedia.org/wiki/Pearson_hashing

Basically, the idea is that it generates a bunch of VERY random numbers into an Array of 256 bins by default, but you might be able to modify it to 1800 for you scenario. The important thing is that the array is small enough to fit into memory.

1
votes

If you need to reduce the number of collisions, it may help to look at a specialized hashing scheme like cuckoo hashing. Essentially you amortize over multiple hash functions to preserve O(1) complexity.

If the collisions are inexpensive though (e.g. they fit on a cache line or they're predictable) you'll still probably see better performance regardless of asymptotic cost with collisions.

Flat structures tend to be used for this reason, since they have good cache characteristics. It's also one of the reasons they tend to be preferred when performance is important.

0
votes

So I spent some time trying different things out. Here are my conclusions so far.

First, one has to realize that fitting 1875 elements into a hash table with 2048 buckets is likely to produce quite many collisions. Indeed, if one considers that each element has equal probability of being assigned to any one of the 2048 buckets, then the expected number of collisions is 646 (by an argument similar to the so-called birthday problem, see https://math.stackexchange.com/questions/35791/birthday-problem-expected-number-of-collisions?rq=1, the formula is expected nb. of collisions = n – N * (1 – (1 – 1/N)^n) where n is the number of elements and N is the number of buckets). This would be the case if for example the 1875 elements were chosen randomly within the [0, 2047] interval with repetitions allowed or if the 1875 elements were chosen randomly in a very large interval with respect to the number of buckets 2048 with or without repetitions.

Bearing this in mind, the 541 collisions obtained with the identity function as the hash function (see the original question) does not seem too bad. The reason why the number of collisions is smaller than in the uniform distribution case despite those large gaps in the distribution is that by the nature of the problem, the 1875 elements have distinct values and therefore only elements larger than 2048 can cause collisions as they are wrapped around using the modulo operator.

Now, we know that a hash function that maps our input interval [0, 4891] to a much larger interval (like a 32 bits integer for example) randomly and uniformly is not desirable since it will cause more collisions than the identity hash function. However one could wonder if it would be possible to have a random and uniform mapping from the input interval [0, 4891] to some not overly large interval (it could be the same [0, 4891] interval or any other interval like [0, 2048], [0, 5000], etc.) which would reduce collisions. I have tried Pearson-like mappings as suggested by rts1 but found that it doesn't improve the number of collisions.

So far what I've settled on using is simply the identity function as the hash function combined with making sure that my number of elements is not too close to my number of buckets (1.5 times the number of elements seems pretty reasonable for the number of buckets).