Updated Question
I know how to use python to create a md5 hash from a file http://docs.python.org/3.5/library/hashlib.html#hash-algorithms. I also know how to read a text file line by line. However my files can grow large, and it is inefficient to read the file twice from beginning to end. I wonder whether it is possible to read the data only once from disc, and like in a stream/pipe, combine the 2 tasks intelligently. May be something like:
- Initialize md5
- open the file in binary mode
- read a chunk of data (e.g. buffer_size=65536) into a buffer
- update the md5 with the chunk just read
- provide the buffer to another stream to continue processing the data
- use TextIOWrapper(?) to read the data again, but this time it is text
- read the text line by line. When the buffer is consumed, ask the underlying layer for more data, until EOF. It'll read more binary data, update md5, provide the new buffer ... and I can continue reading line by line (this is like: repeat from step 3 until EOF)
- upon EOF, I've processed all my text line by line, and have the md5
The objective is to become more efficient, by reading the (large) files from disc just once, instead of twice, by intelligently combining binary md5 calculation and text based processing on the same file.
I hope this explains it better. Thanks again for your help.
Juergen