6
votes

Non-cryptographic hashes such as MurmurHash3 and xxHash are almost exclusively designed for hash tables, but they appear to function comparably (and even favorably) to CRC-32, Adler-32 and Fletcher-32. Non-crypto hashes are often faster than CRC-32 and produce more "random" output similar to slow cryptographic hashes (MD5, SHA). Despite this, I only ever see CRC-32 or MD5 recommended for data integrity/checksum purposes.

In the table below, I tested 32-bit checksum/CRC/hash functions to determine how well they detect small differences in data:

Table

The results in each cell means: A) number of collisions found, and B) minimum and maximum probability that any of the 32 output bits are set to 1. To pass test B, the max and min should be as close as possible to 50. Anything under 45 or over 55 indicates bias.


Looking at the table, MurmurHash3 and Jenkins lookup2 compare favorably to CRC-32 (which actually fails one test). They are also well-distributed. DJB2 and FNV1a pass collision tests but aren't well distributed. Fletcher32 and Adler32 struggle with the NullBytes and 8RandBytes tests.

So then my question is, compared to other checksums, how suitable are 'non-cryptographic hashes' for detecting errors or differences in files? Is there any reason a CRC-32/Adler-32/CRC-64 might outperform any decent 32-bit/64-bit hash?

1
For error detection, you want a high likelihood that flipping any bit in the input produces a different output. Ideally, you want comparably high likelihood for combinations of two or more bit flips. The tests you've performed do not seem to address that. The per-bit probability of that bit being 1 in the result ignores the possibility of correlations.John Bollinger
Concerning "can't we design anything better/faster?" is simphash() faster?chux - Reinstate Monica
For example, consider this algorithm: (1) set the result to zero; (2) for each byte of input, if the parity of the most-significant bit is the same as the parity of the byte's position in the input, then flip all the bits of the result. If I understand the tests correctly, that will produce near-ideal results in all of them (and it could be made very fast!), but it would be extremely ineffective for error detection.John Bollinger
Your Bytes1to255 test tickles a particular property of a CRC whose register is initialized to all 1's and is then fed all 1's, i.e. your sequence of 255's. The collisions you count are not representative of the average behavior of the CRC. The 254 collisions occur after only five bytes (you don't need to go out to 72 to see them). All of the collisions are of the form CRC(3*255+n) == CRC(4*255+~n), where by "*" I mean repeat that many, and by "+" I mean concatenate. "~" means bit-wise inverse.Mark Adler

1 Answers

3
votes

Is there any reason this function would be inferior to CRC-32 or Adler-32 for detecting errors in data?

Yes, for certain kinds of error characteristics. A CRC can be designed to very effectively detect small numbers of bit errors in a packet, as you might expect on an actual communications or storage channel. That's what it's designed for.

For large numbers of errors, any 32-bit check that fills the 32 bits and does a reasonably good job of being sensitive to all of the bits of the packet will work about as well as any other. So your's would be as good as a CRC-32, and a smidge better than an Adler-32. (The Adler-32 deliberately does not use all possible 32-bit values, so has a slightly higher false positive rate than 32-bit checks that use all possible values.)

By the way, looking a little more at your algorithm, it does not distribute over all 32-bit values until you have many bytes of input. So your check would not be as good as any other 32-bit check on a large number of errors until you have covered the possible 32-bit values of the check.