27
votes

Hash Tables are said to be the fastest/best way of Storing/Retrieving data.

My understanding of a hash table, hashing is as follows (Please correct me if I am wrong or Please add If there is anything more):

  • A Hash Table is nothing but an array (single or multi-dimensional) to store values.
  • Hashing is the process to find the index/location in the array to insert/retrieve the data. You take a data item(s) and pass it as a key(s) to a hash function and you would get the index/location where to insert/retrieve the data.

I have a question:

Is the hash function used to store/retrieve the data DIFFERENT from a cryptographic hash function used in security applications for authentication like MD5, HMAC, SHA-1 etc...?

In what way(s) are they different?

  • How to write a hash function in C?
  • Is there some standard or guidelines to it?
  • How do we ensure that the output of a hash function i.e, index is not out of range?

It would be great if you could mention some good links to understand these better.

3
The range can be limited with the module (%) operator.tur1ng
The following page has several implementations of general purpose hash functions implemented in C (and many other languages): partow.net/programming/hashfunctions/index.htmlMatthieu N.

3 Answers

11
votes

A cryptographic hash emphasizes making it difficult for anybody to intentionally create a collision. For a hash table, the emphasis is normally on producing a reasonable spread of results quickly. As such, the two are usually quite different (in particular, a cryptographic hash is normally a lot slower).

For a typical hash function, the result is limited only by the type -- e.g. if it returns a size_t, it's perfectly fine for it to return any possible size_t. It's up to you to reduce that output range to the size of your table (e.g. using the remainder of dividing by the size of your table, which should often be a prime number).

As an example, a fairly typical normal hash function might look something like:

// warning: untested code.
size_t hash(char const *input) { 

    const int ret_size = 32;
    size_t ret = 0x555555;
    const int per_char = 7;

    while (*input) { 
        ret ^= *input++;
        ret = ((ret << per_char) | (ret >> (ret_size - per_char));
   }
   return ret;
}

The basic idea here is to have every bit of the input string affect the result, and to (as quickly as possible) have every bit of the result affected by at least part of the input. Note that I'm not particularly recommending this as a great hash function -- only trying to illustrate some of the basics of what you're trying to accomplish.

4
votes

Bob Jenkins wrote an in-depth description of his good, if slightly outdated, hash function. The article has links to newer, better hash functions, but the writeup addresses the concerns of building a good one.

Also, most hash table implementations actually use an array of linked lists to resolve collisions. If you want to just use an array then the hash function needs to check for collisions and create a new hash index.

The cryptographic hash functions you mention could be used as hash functions for a hash table, but they are much slower than hash functions designed for a hash table. Speed makes brute force attacks easier.

0
votes

The design goals are different.

With cryptographic hash functions you want, for example, that the hash and the hash function cannot be used to determine the original data or any other data that would produce the same hash.

Hash functions used with hash tables & other data structures do not need such security properties. It's often enough if the hash function is fast and it will distribute the input set evenly into the set of possible hashes (to avoid unnecessary clustering / collisions).