5
votes

I started reading about implementing various data structures a few days back, and got around to hash tables and got stuck on a specific point.

My understanding of how a hash table is implemented: A key K is passed to a hash function H which returns a hashed version of K, HK. HK should probably be at least a uint32_t to account for collisions, we have an array of size X, the item is stored at the index HK of this array.. but, wouldn't this require a pre-allocated array of length uint32_t atleast (or whatever the return value of H is)? assuming that we don't store the data itself within that array, and instead store a ptr to the data, then we'd need an array of ptr_t of length uint32_t.. this seems quite wasteful, on 64bit that would mean memory usage of: 2^32 * 8 = 34359738368 bytes or ~32GB just for the array of ptrs to the data, which is obviously not how its actually implemented in real life..

So what am I missing?

4
I think the typical implementation is not using an array but a linked list.Stephan Dollberg
I think the typical implementation is not using a linked list but an array.max taldykin
About collisions, when using a hashtable, there will be collsions. They should be handled, not avoided. They can be minimized by decent hashing and dimension.stefaanv
@bamboon: actually, you use an array of linked-lists for the run-of-the-mill implementation. It's not ideal in case of collision and has poor locality of reference (between items), but the complexity of operations is quite predictable.Matthieu M.
@MatthieuM. Well yeah, that was a missunderstanding, I thought he was talking about the second stage.Stephan Dollberg

4 Answers

4
votes

It depends upon the implementation. There are three basic ways this is done:

1) Small hashes are used. So instead of using a 32-bit hash, say, an 8-bit hash is used.

2) Multiple levels of hashing are used. So, for example, a 12-bit hash may determine which "bucket" an entry goes in, but a collision only occurs if the full 32-bit hash matches. Each bucket is stored in a linked list or similar structure. (Perhaps one optimized for searching for the full 32-bit hash within it.)

3) Sparse arrays are used. These are data structures that don't need to actually store blank spaces for unfilled slots. (In practice, it could be something entirely different such as a tree, but it acts like a sparse array with efficient searching.)

2
votes

You should construct your hash table so it can be extended. There are some methods to do that. Read this it will be helpful. In this example a linked list is used. And you also need to extend your table if there are no empty values anymore. You'll get following problem: if you extend your map, your H function can return new HK values for old K keys. So you must think how to solve this issue. One of the methods is to reload all values when table was extended. It's normal if you extend it not often.

2
votes

Realistically, you have an array of some smaller, fixed amount of buckets, which either use chaining (result in a linked list) or probing (worst example: if hash(x) is taken, try hash(x)+1) on collisions. You take your uint32 and mod by the bucket size, simplest case.

You can define a load factor - once you get to N% of the array being full, we'll, let's say, double the size of the array, and rehash everything into the new array. Let's say, somewhere between 50% and 75% usage.

Well, isn't that expensive, you say? Well, not really. Let's say you double the size of the array each time. So, you add N items, the last of which triggers a copy. N adds at O(1), and then one O(N) copy. But wait - O(N) / N averages out to O(1), so the amortized average cost of adding is still O(1), assuming your load factor is chosen wisely,.

1
votes

The typical implementation of hash tables is an array of linked-lists. The linked-list can be easily swapped for another datastructure, so we will call it a Bucket from now on.

The idea is simple:

class HashTable {
public:


private:
  std::vector<Bucket*> _array;
};

Then, you take HK and reduce it to fit it in the array, usually with a modulo: HK % size(_array) which gives the index of the bucket to be used.