2
votes

I'm writing a Python program to find duplicate files. Calculating MD5 and comparing file-size is not 100% fool-proof since it is possible for two different files to have the same file-size and MD5 (collision probability 2^128).

I was wondering then, perhaps if I add another hash such as SHA1 (2^160) or CRC32 (2^32) to the mix, would that greatly increase the ability to identify unique files, i.e. compare both MD5 and SHA1 of a file for uniqueness? Is SHA1 or CRC32 preferable for this secondary check?

If so, how do I calculate both MD5 and SHA1/CRC32 simultaneously while iterating thru 1MB blocks of a very large file to avoid reading the large file twice? This is what I have for MD5:

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(2 ** 20), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Currently my code, after checking for identical MD5's, would run filecmp.cmp on both files as the final check. This is obviously resource intensive and not efficient.

I'm using SQLite to store these hashes. I assume it is slower than Python Lists but doesn't suffer from insufficient memory when processing billions of files.

1
2^128 is misleading but the actual probability of a collision is small enough that you don't need to worry, MD5 is fine (for non-sensitive applications) stackoverflow.com/questions/4032209/… / crypto.stackexchange.com/questions/12677/…Alex K.
Thanks for the comment. "Small enough" probability won't be enough for someone with 500+GB of photo images to sort thru and eliminate duplicates. Wondering if SHA1/CRC32 is enough to "guarantee" that. Perhaps filecmp.cmp is the only fool-proof way...ebann
Using two hash functions should give you enough certainty to sleep peacefully ;)Maciek

1 Answers

4
votes

You already have the hardest part done. You just have to feed the chunks you read to another hasher:

def calculate_hashes(fname):
    hash_md5 = hashlib.md5()
    hash_sha1 = hashlib.sha1()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(2 ** 20), b""):
            hash_md5.update(chunk)
            hash_sha1.update(chunk)
    return hash_md5.hexdigest(), hash_sha1.hexdigest()