md5 hash of file calculated not correct in Python

Question

I have a function for calculating the md5 hashes of all the files in a drive. A hash is calculated but it's different from the hash I got using other programs or online services that are designed for that.

def md5_files(path, blocksize = 2**20):
    hasher = hashlib.md5()
    hashes = {}
    for root, dirs, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            print(file_path)
            with open(file_path, "rb") as f:
                data = f.read(blocksize)
                if not data:
                    break
                hasher.update(data)
                hashes[file_path] = hasher.hexdigest()
    return hashes

the path provided is the drive letter, for example "K:\" then I navigate through the files and I open the file for binary read. I read chunks of data of the size specified in blocksize. Then I store the filename and md5 hash of every file in a dictionary called hashes. The code looks okay, I also checked other questions on Stack Overflow. I don't know why the generated md5 hash is wrong.

Can I reuse the same variable? For example can I put "hasher = hashlib.md5()" in the inner loop? I mean, inside the "with" statement — Fabio

janbrohl janbrohl · Accepted Answer · 2016-08-06T15:58:02

you need to construct a new md5 object for each file and read it completely. eg. like so

def md5_files(path, blocksize = 2**20):    
    hashes = {}
    for root, dirs, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            print(file_path)
            with open(file_path, "rb") as f:
                data = f.read(blocksize)
                hasher = hashlib.md5(data)
                while data:
                    data = f.read(blocksize)   
                    hasher.update(data)             
                hashes[file_path] = hasher.hexdigest()
    return hashes

md5 hash of file calculated not correct in Python

1 Answers