1
votes

I have to open several thousand files, but only read the first 3 lines.

Currently, I am doing this:

def test_readline(filename):
    fid = open(filename, 'rb')
    lines = [fid.readline() for i in range(3)]

Which yields the result:

The slowest run took 10.20 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 59.2 µs per loop

An alternate solution would be to convert the fid to a list:

def test_list(filename):
    fid = open(filename, 'rb')
    lines = list(fid) 

%timeit test_list(MYFILE)

The slowest run took 4.92 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 374 µs per loop

Yikes!! Is there a faster way to only read the first 3 lines of these files, or is readline() the best? Can you respond with alternatives and timings please?

But at the end-of-the-day I have to open thousands of individual files and they will not be cached. Thus, does it even matter (looks like it does)?

(603µs uncached method readline vs. 1840µs list method)

Additionally, here is the readlines() method:

def test_readlines(filename):
    fid = open(filename, 'rb')
    lines = fid.readlines() 
    return lines

The slowest run took 7.17 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 334 µs per loop

1
600µs for 1000 files is still just 0.6 seconds. Not bad for operating on 1000 files I'd say. "Faster" is fine, but at what point is it too slow? How often do you have to do this and how fast does it need to be?deceze
If you know that the first three lines won’t ever exceed a certain size and you’re okay with overshooting, .readlines() also accepts a parameter with a maximum number of bytes or characters to read. Little weird for most situations though.Ry-
600µs is per file. So it does 'add up'. And I am doing a lot of other things later in the code. Every bit helps and trying to optimize.name goes here
@Ryan You should add as an answer instead of a comment and I can time it. But from what I know of readlines() is that it would read the entire file first.name goes here
@evanleeturner: It’s conditional on something you haven’t answered. Do the first three lines have a hard size limit?Ry-

1 Answers

1
votes

You can slice an iterable with itertools.islice:

import itertools


def test_list(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        return list(itertools.islice(f, 3))

(I changed the open up a bit because it’s slightly unusual to read files in binary mode by line, but you can revert that.)