9
votes

I have a client who is distributing large binary files internally. They are also passing md5 checksums of the files and apparently verifying the files against the checksum before use as part of their workflow.

However they claim that "often" they are encountering corruption in the files where the md5 is still saying that the file is good.

Everything I've read suggests that this should be hugely unlikely.

Does this sound likely? Would another hashing algorithm provide better results? Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?

NB, I don't yet know what "often" means in this context. They are processing hundreds of files a day. I don't know if this is a daily, monthly or yearly occurrence.

6
Have them provide an example of a "corrupt" file and the "good" original.Anon.
Is it possible that the md5 sum was computed on a corrupt file, or that the corruption is occurring after the sum was computed? To know for sure, take Anon's suggestion and get an example of two files with the same checksum.BMitch
So since then, have you looked at the bittorrent sync idea? getsync.com]dlamblin

6 Answers

13
votes

MD5 is a 128 bit cryptographic hash function, so different messages should be distributed pretty well over the 128-bit space. That would mean that two files (excluding files specifically built to defeat MD5) should have a 1 in 2^128 chance of collision. In other words, if a pair of files is compared every nanosecond, it wouldn't have happened yet.

6
votes

If a file is corrupted, then the probability that the corrupted file has the same md5 checksum as the uncorrupted file is 1:2^128. In other words, it will happen almost as "often" as never. It is astronomically more likely that your client is misreporting what really happened (like they are computing the wrong hash)

6
votes

Sounds like a bug in their use of MD5 (maybe they are MD5-ing the wrong files), or a bug in the library that they're using. For example, an older MD5 program that I used once didn't handle files over 2GB.

This question suggests that, on average, you get a collision on average every 100 years if you were generating 6 billion files per second, so it's quite unlikely.

4
votes

Does this sound likely?

No, the chance of a random corruption causing the same checksum is 1 in 2128 or 3.40 × 1038. This number puts 1 in a billion (109) chance to shame.

Would another hashing algorithm provide better results?

Probably not. While MD5 has been broken for collision-resistance against attack, it's fine against random corruption and a popular standard to use.

Should I actually be looking at process problems such as them claiming to check the checksum, but not really doing it?

Probably, but consider all possible points of problems:

  1. File corrupted before MD5 generation
  2. File corrupted after MD5 verification.
  3. MD5 program or supporting framework has a bug
  4. Operator misuse (unintentional, e.g. running MD5 program on wrong file)
  5. Operator abuse (intentional, e.g. skipping the verification step)

IF it is the last, then one final thought is to distribute the files in a wrapper format that forces the operator to unwrap the file, but the unwrapping does verification during extraction. I thinking something like Gzip or 7-Zip that supports large files and possibly turning off compression (I don't know that those do).

0
votes

There are all sorts of reasons that binaries either won't get distributed or if they do, there is corruption (firewall, size limitation, virus insertions, etc). You should always encrypt files (even a low level encryption is better than none) when sending binary files to help protect data integrity.

0
votes

Couldn't resist a back-of-envelope calculation:

There are 2^128 possible MD5 hashes or c. 3.4 x 10^38 (that is odds 340 billion, billion, billion, billion, billion, billion, billion, billion, billion,billion, billion to 1 against). Lets call this number 'M'

The probability of the Kth hash matching if the 1 to (K-1)th matches didn't is (1-(K-1)/M) as we have K-1 unique hashes already out of M.

And P(no duplicate in N file hashes) = Product [k = 1...N] (1-(k-1)/M). When N^2 <<< M then this approximates to 1 - 1/2 N^2 / M and P(one or more duplicates) = 1/2 N^2 / M when 1/2 N^2 is an approximation to the number of pair-wise matches of hashes that have to be made

So lets say we take photograph of EVERYONE ON THE PLANET (7.8 billion, or a little under 2^33) then there are 30.4 billion billion billion pair-wise comparisons to make (a little under 2^65).

This means that the chance of a matching MD5 hash (assuming perfectly even distribution) is still 2^65/2^128 = 2^-63 or 1 in 10,000,000,000,000,000,000.

MD5 is a pretty decent hash function for non-hostile environments which means the chance of your clients having a false match is far less likely than say the chance of their CEO going crazy and burning down the data centre, let alone the stuff they actually worry about.