Calculate md5 on the fly while reading large text file

Question

Updated Question

I know how to use python to create a md5 hash from a file http://docs.python.org/3.5/library/hashlib.html#hash-algorithms. I also know how to read a text file line by line. However my files can grow large, and it is inefficient to read the file twice from beginning to end. I wonder whether it is possible to read the data only once from disc, and like in a stream/pipe, combine the 2 tasks intelligently. May be something like:

Initialize md5
open the file in binary mode
read a chunk of data (e.g. buffer_size=65536) into a buffer
update the md5 with the chunk just read
provide the buffer to another stream to continue processing the data
use TextIOWrapper(?) to read the data again, but this time it is text
read the text line by line. When the buffer is consumed, ask the underlying layer for more data, until EOF. It'll read more binary data, update md5, provide the new buffer ... and I can continue reading line by line (this is like: repeat from step 3 until EOF)
upon EOF, I've processed all my text line by line, and have the md5

The objective is to become more efficient, by reading the (large) files from disc just once, instead of twice, by intelligently combining binary md5 calculation and text based processing on the same file.

I hope this explains it better. Thanks again for your help.

Juergen

This page I know and I've read it before. But where exactly does it describe that I want to use the very same buffer I've read for md5, for reading the text line by line in there? — Juergen
Look my problem is not md5. My problem is a) read buffers from a binary file b) do something with the buffer c) use that buffer (which is bytes not string) to read the text in there line by line. You don't have a working example by any chance? — Juergen

Martijn Pieters Martijn Pieters · Accepted Answer · 2016-12-27T18:10:18

Yes, just create a single hashlib.md5() object and update it with each chunk:

md5sum = hashlib.md5()

buffer_size = 2048  # 2kb, adjust as needed.

with open(..., 'rb') as fileobj:
    # read a binary file in chunks
    for chunk in iter(lambda: fileobj.read(buffer_size), b''):
        # update the hash object
        md5sum.update(chunk)

# produce the final hash digest in hex.
print(md5sum.hexdigest())

If you need to also read the data as text, you'll have to write your own wrapper:

either one that implements the TextIOBase API (implement all stub methods that relate to reading), and draw data from the BufferedIOReader object produced by the open(..., 'rb') call each time a line is requested. You'll have to do your own line splitting and decoding at that point.
or one that implements the BufferedIOBase API (again implement all stub methods), and pass this as the buffer to a TextIOWrapper class.

Calculate md5 on the fly while reading large text file

2 Answers