Analysis of open addressing

Question

I am currently learning hash tables from "An introduction of algorithms 3th". Get quite confused while trying to understand open addressing from statistical point of view. Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1. In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m. The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand. Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured. I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:

Q: In reality, if the load factor is less than 1, why would we bother open addressing ? Why not project each key to an integer and arrange them in an array ?

Tony Delroy Tony Delroy · Accepted Answer · 2016-03-01T23:42:07

Q: In reality, if the load factor is less than 1, why would we bother open addressing? Why not project each key to an integer and arrange them in an array ?

Because in many situations when hash tables are used, there's no good O(1) way to "project each key to an [distinct, not-absurdly-sparse] integer" array index.

A simple thought experiment illustrates this: say you expect the user to type four three-uppercase-letter keys, and you want to store them somewhere in an array with dimension 10. You have 26⁴ possible inputs, so no matter what your logic is, on average 26⁴/10 of them will "project... to an integer" indicating the same array position. When you realise the "project[ion]" can't avoid potential "collisions", and that projection is a logically identical operation to "hashing" and modding to a "bucket", then some collision-handling logic will be needed, your proposed "alternative" morphs back into a hash table....

Linear probing and quadratic probing can only generate m possible probe sequence, assuming m is hash table length. However, as defined in open addressing, the possible key value number is greater than the number of hash values, i.e. load factor n/m< 1.

They are very confusing statements. The "number of hash values" is not arbitrarily limited - you could use a 32 bit hash generating any of ~4 billion hash values, a 512-bit hash, or whatever other size you feel like. Given the structure of your statement is "a > b, i.e. load factor n/m < 1", and "n/m < 1" can be rewritten as "n < m" or "m > n", you imply "a" and "m" are meant to be the same thing, as are "b" and "n":

you're referring to m - which "load factor n/m" requires be the number of buckets in the hash table - as "the possible key value number": it's not, and what could that even mean?
you're referring to n - which "load factor n/m" requires be the number of keys stored in the hash table - as "the number of hash values": it's not, except in the trivial sense of that many (not necessarily distinct) hash values being generated when the keys are hashed

In reality, if the hash function is predefined, there exists only n possible probe sequence, which is less than m.

Again, that's a very poorly defined statement. The hashing of n keys can identify at most n distinct buckets from which collision-handling would kick in, but those n could begin pretty much anywhere within the m buckets, given the hash function's job is to spray them around. And, so what?

The same thing applies to double hashing. If the book says, one hash function is randomly chosen from a set of universal hash functions, then, I can understand.

Understand what?

Without introducing randomness in open addressing analysis, the analysis of its performance based on universal hashing is obscured.

For sure. "Repeatable randomness" of hashing is a very convenient and tangible benchmark against which specific implementations can be compare.

I have never used hash table in practice, maybe I dive too much into the details. But I also have such doubt in hash table's practical usage:

Analysis of open addressing

1 Answers