1
votes

I have a function for calculating the md5 hashes of all the files in a drive. A hash is calculated but it's different from the hash I got using other programs or online services that are designed for that.

def md5_files(path, blocksize = 2**20):
    hasher = hashlib.md5()
    hashes = {}
    for root, dirs, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            print(file_path)
            with open(file_path, "rb") as f:
                data = f.read(blocksize)
                if not data:
                    break
                hasher.update(data)
                hashes[file_path] = hasher.hexdigest()
    return hashes

the path provided is the drive letter, for example "K:\" then I navigate through the files and I open the file for binary read. I read chunks of data of the size specified in blocksize. Then I store the filename and md5 hash of every file in a dictionary called hashes. The code looks okay, I also checked other questions on Stack Overflow. I don't know why the generated md5 hash is wrong.

1
You need to create a new hasher for each file.Aran-Fey
Can I reuse the same variable? For example can I put "hasher = hashlib.md5()" in the inner loop? I mean, inside the "with" statementFabio

1 Answers

1
votes

you need to construct a new md5 object for each file and read it completely. eg. like so

def md5_files(path, blocksize = 2**20):    
    hashes = {}
    for root, dirs, files in os.walk(path):
        for file in files:
            file_path = os.path.join(root, file)
            print(file_path)
            with open(file_path, "rb") as f:
                data = f.read(blocksize)
                hasher = hashlib.md5(data)
                while data:
                    data = f.read(blocksize)   
                    hasher.update(data)             
                hashes[file_path] = hasher.hexdigest()
    return hashes