47
votes

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector. Glad to know.

Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.

2
As a side note, there is never a good reason to write len_file = len(file_content), then a while( i < len_file ): loop with i += 1 and file_content[i] inside. Just use for line in file_content:. If you also need i for something else, use for i, line in enumerate(file_content). You're making things harder for yourself and your readers (and for the interpreter, which means your code may run slower, but that's usually much less important here). - abarnert
Thanks @abarnert. I'll change them . - Learner
One last style note: In Python, you can just write if filename.endswith(".gz"):; you don't need parentheses around the condition, and shouldn't use them. One of the great things about Python is how easy it is both to skim quickly and to read in-depth, but putting in those parentheses makes it much harder to skim (because you have to figure out whether there's a multi-line expression, a tuple, a genexp, or just code written by a C/Java/JavaScript programmer). - abarnert
Nice tip,duly noted. Will change them as well. - Learner

2 Answers

94
votes

The short version is: The efficient way to use readlines() is to not use it. Ever.


I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.


On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.


You can work around this in three ways:

  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

For example, this has to read all of foo at once:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

But this only reads about 8K at a time:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

And this will do the exact same thing as the previous:

with open('foo') as f:
    for line in f:
        pass

Meanwhile:

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python doesn't make any such guarantees about garbage collection.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).


Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

Or, maybe:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...
18
votes

Read line by line, not the whole file:

for line in open(file_name, 'rb'):
    # process line here

Even better use with for automatically closing the file:

with open(file_name, 'rb') as f:
    for line in f:
        # process line here

The above will read the file object using an iterator, one line at a time.