5
votes

I would like to construct a hash table that looks up keys in sequences (strings) of bytes ranging from 1 to 15 bytes.

I would like to store an integer value, so I imagine an array for hashing would suffice. I'm having difficulty conceptualizing how to construct a hash function such that given the key would give an index into the array.

Any assistance would be much appreiated.

The maximum number of entries in the hash is: 4081*15 + 4081*14 + ... 4081 = 4081((15*(16))/2) = 489720.

So for example:

int table[489720];

int lookup(unsigned char *key)
{
    int index = hash(key);
    return table[index];
}

What are some good choices for a hash function, or how would I go about constructing one?

Thanks.

4
If two keys map to the same index, you have a collision, which is not correctly handled in your example. Did you keep your example this simply just to illustrate your hashing, or do you really need an additional explanation about hashing tables itself also? (open hashing, closed hashing, ...)Patrick

4 Answers

3
votes

To hash C strings, I've always used this function (take the result % your hash table's size):

int hashstring(const char* s) {
  int key = 0;
  while (*s) {
    key = key*37 + *s++;
  }
  return key;
}

I don't remember where I got it from initially, but in many years it hasn't let me down.

2
votes

Your key space is large (approx 2^(8*15)), so if you want a perfect hash, you will need to know what 489720 actual keys will show up in advance. Even then, it is practically impossible to find a perfect hash for those keys, even if you allowed a much larger table (a.k.a. a very low load factor). The only way I know to find a perfect hash is by trial and error, and a random hash is likely to fail unless your table has close to 489720^2 entries.

I highly recommend using a regular (non-perfect) hash and deal with collisions appropriately, e.g. with chaining:

struct entry {
  unsigned char *key;
  int value;
  struct entry *next;
} *table[1<<20];
int lookup(unsigned char *key) {
  int index = hash(key) % (1<<20);
  for (struct entry *e = table[index]; e != NULL; e = e->next) {
    if (!strcmp(key, e->key)) return e->value;
  }
  // not found
}

I also recommend you don't implement this yourself - use a standard library like a c++ hashmap.

0
votes

If you want a perfect hash, then you can start by reading the Wikipedia article on perfect hashing. If you run into snags , you can ask for help here.

0
votes

If the the average number of strings resident in the table is low--like under 10,000 entries--an associative array would be a reasonable approach, even using a linear search if it's on a modern CPU architecture.

Otherwise, constructing a "perfect hash" requires inspecting each character of the string and computing a unique value based on the possible range. For example, if only the 26 characters A..Z are allowed in the key, this would work:

int
hash (const char *key)
{
   int h = 0;
   while (key && *key)
       h = h * 26 + (*key++ - 'A');
   return h;
}