6
votes

Is there a way to generate a hash of a string so that the hash itself would be of specific length? I've got a function that generates 41-byte hashes (SHA-1), but I need it to be 33-bytes max (because of certain hardware limitations). If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.

Or actually I suppose an MD5 algorithm would fit nicely, if I could find some C code for one with your help.

EDIT: Thank you all for the quick and knowledgeable responses. I've chosen to go with an MD5 hash and it fits fine for my purpose. The uniqueness is an important issue, but I don't expect the number of those hashes to be very large at any given time - these hashes represent software servers on a home LAN, so at max there would be 5, maybe 10 running.

9

9 Answers

7
votes

If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.

What makes you think you've got uniqueness now? Yes, there's clearly a higher chance of collision when you're only playing with 33 bytes instead of 41, but you need to be fully aware that collisions are only ever unlikely, not impossible, for any situation where it makes sense to use a hash in the first place. If you're hashing more than 41 bytes of data, there are clearly more possible combinations than there are hashes available.

Now, whether you'd be better off truncating the SHA-1 hash or using a shorter hash such as MD5, I don't know. I think I'd be more generally confident when keeping the whole of a hash, but MD5 has known vulnerabilities which may or may not be a problem for your particular application.

5
votes

The way hashes are calculated that's unfortunately not possible. To limit the hash length to 33 bytes, you will have to cut it. You could xor the first and last 33 bytes, as that might keep more of the information. But even with 33 bytes you don't have that big a chance of a collision.

md5: http://www.md5hashing.com/c++/

btw. md5 is 16 bytes, sha1 20 bytes and sha256 is 32 bytes, however as hexstrings, they all double in size. If you can store bytes, you can even use sha256.

4
votes

There is no more chance of collision with substring(sha_hash, 0, 33) than with any other hash that is 33 bytes long, due to the way hash algorithms are designed (entropy is evenly spread out in the resulting string).

3
votes

You could use an Elf hash(<- C code included) or some other simple hash function like that instead of MD5 or SHA-X. They are not secure, but they can be tuned to any length you need

/*****Please include following header files*****/
// string
/***********************************************/

/*****Please use following namespaces*****/
// std
/*****************************************/

static unsigned int ELFHash(string str) {
    unsigned int hash = 0;
    unsigned int x = 0;
    unsigned int i = 0;
    unsigned int len = str.length();

    for (i = 0; i < len; i++)
    {
        hash = (hash << 4) + (str[i]);
        if ((x = hash & 0xF0000000) != 0)
        {
            hash ^= (x >> 24);
        }
        hash &= ~x;
    }

    return hash;
}

Example

string data = "jdfgsdhfsdfsd 6445dsfsd7fg/*/+bfjsdgf%$^";
unsigned int value = ELFHash(data);

Output

248446350
2
votes

Hashes are by definition only unique for small amount of data (and even then it's still not guaranteed). It is impossible to map a large amount of information uniquely to a small amount of information by virtue of the fact that you can't mmagically get rid of information and get it back later. Keep in mind this isn't compression going on.

Personally, I'd use MD5 (if you need to store in text), or a 256b (32B) hash such as SHA256 (if you can store in binary) in this situation. Truncating another hash algorithm to 33B works too, and MAY increase the possibility of generating hash collisions. It depends alot on the algorithm.

Also, yet another C implementation of MD5, by the people who designed it.

1
votes

I believe the MD5 hashing algorithm results in a 32 digit number so maybe that one will be more suitable.

Edit: to access MD5 functionality, it should be possible to hook into the openssl libraries. However you mentioned hardware limitations so this may no be possible in your case.

1
votes

Here is an MD5 implementation in C.

1
votes

The chance of a 33-byte collision is 1/2^132 (by the Birthday Paradox)

So don't worry about losing uniqueness.

Update: I didn't check the actual byte length of SHA1. Here's the relevant calculation: a 32-nibble collision (33 bytes of hex - 1 termination char), occurs only when the number of strings hashed becomes around sqrt(2^(32*4)) = 2^64.