1
votes

This might be a silly question, but here goes :

I hashed a dictionary of words into an unordered_set based hash-table. My hash function was made intentionally "bad", in that all strings that contained the same set of letters would hash to the same value. I initially tried to over-ride the normal hash function behaviour, and use a "frequency histogram" of the letters in each word as a hash value (which I learnt was impossible :) ), but one of the threads suggested using a 26-bit bitmask to achieve the same. The hash function works fine and dandy thus far.

For example, in my scheme, CITIED and CITED hash to the same value, 1049144. My idea was that given a set of letters, I wanted to find all the words containing letters from that set.

I am guessing that I haven't quite understood the concept of hashing (or my code is plain wrong), as I can't quite explain the behaviour I encountered :
I decided to look for all words which consisted of letters from the string "LIVEN". My output (with hash key) was as follows :

VENVILLE,4215328  
LEVIN,4215328  
ENLIVEN,4215328  
CURTSEYED,37486648  

How on earth did CURTSEYED land up there? As can be seen, it has a different hash value from the remaining three words. Where does the fault lie with my understanding/implementation of the hash table?

Code that produced above output:


    typedef std::unordered_set&lt std::string, my_string_hash_function, my_string_equality&gt Dict    
    DictHash dict;       
    DictHash::const_local_iterator c_l_itr;

    DictHash::size_type bs = dict.bucket (std::string ("LIVEN"));
    for (c_l_itr = dict.begin(bs); c_l_itr != dict.end(bs); c_l_itr++)
         std::cout 

My hash function :

struct my_string_hash_function  
{  
    std::size_t operator()(const std::string& s) const  
    {  
        unsigned long hash = 0;  
        std::string::const_iterator itr;

        for (itr = s.begin(); itr != s.end(); itr++)
     hash |= 2 << (*itr - int('A'));

      return hash;
    } 
};

Comparison function :

struct my_string_equality
{
    bool operator()(const std::string& s1, const std::string& s2) const
    {
        if (s1.length() != s2.length())
     return false; 

        unsigned int hash1 = 0, hash2 = 0;
        const char *str1, *str2;
        int i,len;

        len = s1.length();
        str1 = s1.c_str();
        str2 = s2.c_str();

        for (i = 0; i < len; i++)
        {
            hash1 |= 2 << (str1[i] - (int)'A');
            hash2 |= 2 << (str2[i] - (int)'A');
        }

        return hash1 == hash2;
   }
};
2

2 Answers

3
votes

Different hash values will not necessarily end up in different buckets. Generally a hash table will choose a bucket based on hash_value % number_of_buckets, so hashes that are equal modulo the number of buckets will wind up in the same bucket.

Essentially, you cannot guarantee anything about which hash value appears in which bucket.

0
votes

I think you've also got a potential bug in the my_string_equality... don't you just want to use the regular std::string::operator==()? AFAIK you should be doing a comparison of the actual object values, not a comparison of their hash (the container already knows the hash value, it could just call my_string_hash_function and compare the results if that was what it needed to do).