4
votes

Suppose we know keys and distribution in advance and we want to build fast lookup dictionary. We insert and never delete items.

For example here are keys with frequency

  • xyz 1000
  • abc 5
  • abd 20

Good hash would be simply some function of first character that maps xyz to 1 bucket and the abc,abd to bucket 2. xyz dominates distribution so we focus on that. Looking at only 1 character is faster than looking at all 3. Also during lookup the number of elements in a bucket is 1 then we know for sure our key that we are looking up must be in that bucket. No need to compare xyz to xyz.

Since we know keys, distributions in advance we could go for perfect hash, but then the hash function could be slow.

I'm not looking for optimum solution but practical one.

1
In the simple hash function that you propose, there is no awareness of the distribution. So you could have only 1 item in the bucket with xyz but you could have also 100 items in this bucket. So we have to think at a function that takes in consideration the key and the distribution also...bogdan.herti
If there only are three values, a linear search ( in the order xyz, abd abc) will probably faster than anything else.wildplasser
@wildplasser He just gave an example I think.bogdan.herti

1 Answers

1
votes

Some considerations and after that I will propose my solution:

  • If you implement a lookup table you need to realize two things: a hash function and a collision solving technique

  • A fast hash function but inefficient could generally lead to slow collision solving: uniformity of the hash functions. In your case knowing the distribution it will help as I will write below.

You could have "xyz" in a bucket together with another 100 keys that start with "x" and don't have a great distribution. The worst case here for GET could be O(100) or generally O(d), where d is the size of the bucket and this could be far from O(1).

My solution takes into consideration the distribution. Not for the hash function but to the collision technique.

If you thought at using chaining (keys that hash to the same value are kept in a list = the bucket that you mentioned) you could implement the list based on the distribution like this:

  • avoid inserting by default at the beginning of the list and having INSERT in O(1) and GET in O(d), d - the size of the bucket
  • but INSERT in the decreasing order of the distribution in O(d) and have GET close to O(1) for the keys with high distribution. Because the keys with high distribution would be kept in the first positions.

In this way:

  • you will have very fast GET operations (if you have GET operations more often than INSERT operations => this can be a very good deal for you)
  • you could use a fast and simple hash function, as you proposed the one based only on the first character