4
votes

Let's say you are planning to design a hash function which will generate keys between 0-256. Will using first 2 digits of MD5-digest be a great idea for a uniform distribution? What do you think on this? Is it expensive to md5() some word (2-10 letters)?

I know it is a rough definition of requirements but it would be great to discuss this.

4
I imagine there is no guarantee that a subset of an MD5 hash has a uniform distribution (similar to how GUIDs work). - David
Considering a multi-megabyte file can be MD5'ed like, well, "instantly" on modern hardware... however, with such a short input, uhh.. hmm. - user166390
Out of curiosity, why are you hashing two character strings to two character hashes? - user229044
I agree with David. You're probably best off just writing up a quick test app that runs your design several thousand times so you can get an idea of the cost and statistical distribution. - Spencer Hakim
If you are looking for a 1 byte hash. Perhaps CRC8 will work better. - tidwall

4 Answers

4
votes

There's no reason to use a cryptographic strength hash for something as simple as generating 3 digit hashes. You're better off using a more simple hash there.

I'm not certain specifically how expensive MD5 is relative to others, but there are plenty of better ways to create a small hash (see this article for some algorithm ideas).

3
votes
1
votes

MD5 is designed to uniformaly spread the input over all the output bytes so it's as good as any other general hash function - sounds like a bit of overkill if you only want 256 values.

Note the output of MD5 is 128bytes (16bytes), it's only the text representation that is hex digits - so there is really no first two digits of MD5 - just use the bottom 8bits.

0
votes

You haven't explained how you're going to use the hash, and what you're going to do with the collisions that are inevitable given that you have only 256 output values.

I think even MD5 (which is not cryptographically secure any more) is overkill for the likely applications.

I'd probably go with a CRC (cyclic redundancy check) algorithm that would generate a 16-bit or 32-bit number for you, and would likely give you good enough distribution.