Can CRC32 be used as a hash function? Any drawbacks to this approach? Any tradedeoffs?
3 Answers
CRC32 works very well as a hash algorithm. The whole point of a CRC is to hash a stream of bytes with as few collisions as possible. That said, there are a few points to consider:
CRC's are not secure. For secure hashing you need a much more computationally expensive algorithm. For a simple bucket hasher, security is usually a non-issue.
Different CRC flavors exist with different properties. Make sure you use the right algorithm, e.g. with hash polynomial 0x11EDC6F41 (CRC32C) which is the optimal general purpose choice.
As a hashing speed/quality trade-off, the x86 CRC32 instruction is tough to beat. However, this instruction doesn't exist in older CPU's so beware of portability problems.
---- EDIT ----
Mark Adler provided a link to a useful article for hash evaluation by Bret Mulvey. Using the source code provided in the article, I ran the "bucket test" for both CRC32C and Jenkins96. These tables show the probability that a truly uniform distribution would be worse than the measured result by chance alone. So, higher numbers are better. The author considered 0.05 or lower to be weak and 0.01 or lower to be very weak. I'm entirely trusting the author on all this and am just reporting results.
I placed an * by all the instances where CRC32C performed better than Jenkins96. By this simple tally, CRC32C was a more uniform hash than Jenkins96 54 of 96 times. Especially if you can use the x86 CRC32 instruction, the speed performance trade-off is excellent.
CRC32C (0x1EDC6F41) Uniform keys Text keys Sparse keys Bits Lower Upper Lower Upper Lower Upper 1 0.671 *0.671 *1.000 0.120 *0.572 *0.572 2 *0.706 *0.165 *0.729 *0.919 0.277 0.440 3 *0.878 *0.879 *0.556 0.362 *0.535 *0.542 4 0.573 0.332 0.433 0.462 *0.855 0.393 5 0.023 *0.681 0.470 0.907 0.266 0.059 6 *0.145 *0.523 0.354 *0.172 *0.336 0.588 7 0.424 0.722 0.172 *0.736 0.184 *0.842 8 *0.767 0.507 *0.533 0.437 0.337 0.321 9 0.480 0.725 *0.753 *0.807 *0.618 0.025 10 *0.719 0.161 *0.970 *0.740 *0.789 0.344 11 *0.610 0.225 *0.849 *0.814 *0.854 *0.003 12 *0.979 *0.239 *0.709 0.786 0.171 *0.865 13 *0.515 0.395 0.192 0.600 0.869 *0.238 14 0.089 *0.609 0.055 *0.414 *0.286 *0.398 15 *0.372 *0.719 *0.944 0.100 *0.852 *0.300 16 0.015 *0.946 *0.467 0.459 0.372 *0.793
And for Jenkins96, which the author of article considered to be an excellent hash:
Jenkins96 Uniform keys Text keys Sparse keys Bits Lower Upper Lower Upper Lower Upper 1 0.888 0.572 0.090 0.322 0.090 0.203 2 0.198 0.027 0.505 0.447 0.729 0.825 3 0.444 0.510 0.360 0.444 0.467 0.540 4 0.974 0.783 0.724 0.971 0.439 0.902 5 0.308 0.383 0.686 0.940 0.424 0.119 6 0.138 0.505 0.907 0.103 0.300 0.891 7 0.710 0.956 0.202 0.407 0.792 0.506 8 0.031 0.552 0.229 0.573 0.407 0.688 9 0.682 0.990 0.276 0.075 0.269 0.543 10 0.382 0.933 0.038 0.559 0.746 0.511 11 0.043 0.918 0.101 0.290 0.584 0.822 12 0.895 0.036 0.207 0.966 0.486 0.533 13 0.290 0.872 0.902 0.934 0.877 0.155 14 0.859 0.568 0.428 0.027 0.136 0.265 15 0.290 0.420 0.915 0.465 0.532 0.059 16 0.155 0.922 0.036 0.577 0.545 0.336
I don't know why Mark Adler said that "crc32 poorly distributes the input bits to hash". There is no single bit in the crc32 hash that is exactly equal to the input bits. Any bit of the hash is a linear combination of the input bits. Secondly, crc always evenly map the same number of different of input sequences to a given hash value. For example, if you have 1000 bits long message, after crc32, you can always find 2^(1000-32) sequences that produce a given hash value, no more, no less.
If you do not need the security feature, crc can serve as hash perfectly.
Actually, I think other non-secure hash functions may be simpler than crc, if you need to have a longer crc, for example crc-256.
CRC32 maps bytes to 32-bit integers, before accumulating them with xor. That means each byte affects only 8 out of 32 bits in your hash. Of course CRC32 does shifting too, but it only hides the problem under the rug. I.e. it will distribute keys unevenly, there will be heavy clustering at some region. It may appear such hash works fine, until you hit that region, and suddenly your O(1) hash table turns into the O(n) one.
CRC32 was designed for detecting damaged files, not hashing. And as Mark mentioned it wont protect your files from modification, since hackers can still modify them at will by just inserting a properly crafted 32bit value after the change.