72
votes

For example, files, in Python, are iterable - they iterate over the lines in the file. I want to count the number of lines.

One quick way is to do this:

lines = len(list(open(fname)))

However, this loads the whole file into memory (at once). This rather defeats the purpose of an iterator (which only needs to keep the current line in memory).

This doesn't work:

lines = len(line for line in open(fname))

as generators don't have a length.

Is there any way to do this short of defining a count function?

def count(i):
    c = 0
    for el in i: c += 1
    return c

To clarify, I understand that the whole file will have to be read! I just don't want it in memory all at once

10
to count the number of lines you will load the file in memory anyway!hasen
lists (all sequence types) are iterables as well. what you mean is "iterator"user3850
@hasen: yes, but not all at once.Claudiu

10 Answers

84
votes

Short of iterating through the iterable and counting the number of iterations, no. That's what makes it an iterable and not a list. This isn't really even a python-specific problem. Look at the classic linked-list data structure. Finding the length is an O(n) operation that involves iterating the whole list to find the number of elements.

As mcrute mentioned above, you can probably reduce your function to:

def count_iterable(i):
    return sum(1 for e in i)

Of course, if you're defining your own iterable object you can always implement __len__ yourself and keep an element count somewhere.

23
votes

If you need a count of lines you can do this, I don't know of any better way to do it:

line_count = sum(1 for line in open("yourfile.txt"))
13
votes

The cardinality package provides an efficient count() function and some related functions to count and check the size of any iterable: http://cardinality.readthedocs.org/

import cardinality

it = some_iterable(...)
print(cardinality.count(it))

Internally it uses enumerate() and collections.deque() to move all the actual looping and counting logic to the C level, resulting in a considerable speedup over for loops in Python.

11
votes

I've used this redefinition for some time now:

def len(thingy):
    try:
        return thingy.__len__()
    except AttributeError:
        return sum(1 for item in iter(thingy))
8
votes

It turns out there is an implemented solution for this common problem. Consider using the ilen() function from more_itertools.

more_itertools.ilen(iterable)

An example of printing a number of lines in a file (we use the with statement to safely handle closing files):

# Example
import more_itertools

with open("foo.py", "r+") as f:
    print(more_itertools.ilen(f))

# Output: 433

This example returns the same result as solutions presented earlier for totaling lines in a file:

# Equivalent code
with open("foo.py", "r+") as f:
    print(sum(1 for line in f))

# Output: 433
7
votes

Absolutely not, for the simple reason that iterables are not guaranteed to be finite.

Consider this perfectly legal generator function:

def forever():
    while True:
        yield "I will run forever"

Attempting to calculate the length of this function with len([x for x in forever()]) will clearly not work.

As you noted, much of the purpose of iterators/generators is to be able to work on a large dataset without loading it all into memory. The fact that you can't get an immediate length should be considered a tradeoff.

1
votes

Because apparently the duplication wasn't noticed at the time, I'll post an extract from my answer to the duplicate here as well:

There is a way to perform meaningfully faster than sum(1 for i in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs.

# On Python 2 only, get zip that lazily generates results instead of returning list
from future_builtins import zip

from collections import deque
from itertools import count

def ilen(it):
    # Make a stateful counting iterator
    cnt = count()
    # zip it with the input iterator, then drain until input exhausted at C level
    deque(zip(it, cnt), 0) # cnt must be second zip arg to avoid advancing too far
    # Since count 0 based, the next value is the count
    return next(cnt)

Like len(list(it)), ilen(it) performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.

Rather than repeat all the performance numbers here, I'll just point you to my answer with the full perf details.

0
votes

We'll, if you think about it, how do you propose you find the number of lines in a file without reading the whole file for newlines? Sure, you can find the size of the file, and if you can gurantee that the length of a line is x, you can get the number of lines in a file. But unless you have some kind of constraint, I fail to see how this can work at all. Also, since iterables can be infinitely long...

0
votes

For filtering, this variation can be used:

sum(is_good(item) for item in iterable)

which can be naturally read as "count good items" and is shorter and simpler (although perhaps less idiomatic) than:

sum(1 for item in iterable if is_good(item)))

Note: The fact that True evaluates to 1 in numeric contexts is specified in the docs (https://docs.python.org/3.6/library/stdtypes.html#boolean-values), so this coercion is not a hack (as opposed to some other languages like C/C++).

-1
votes

I did a test between the two common procedures in some code of mine, which finds how many graphs on n vertices there are, to see which method of counting elements of a generated list goes faster. Sage has a generator graphs(n) which generates all graphs on n vertices. I created two functions which obtain the length of a list obtained by an iterator in two different ways and timed each of them (averaging over 100 test runs) using the time.time() function. The functions were as follows:

def test_code_list(n):
    l = graphs(n)
    return len(list(l))

and

def test_code_sum(n):
    S = sum(1 for _ in graphs(n))
    return S

Now I time each method

import time

t0 = time.time()
for i in range(100):
    test_code_list(5)
t1 = time.time()

avg_time = (t1-t0)/10

print 'average list method time = %s' % avg_time


t0 = time.time()
for i in range(100):
    test_code_sum(5)
t1 = time.time()

avg_time = (t1-t0)/100

print "average sum method time = %s" % avg_time

average list method time = 0.0391882109642

average sum method time = 0.0418473792076

So computing the number of graphs on n=5 vertices this way, the list method is slightly faster (although 100 test runs isn't a great sample size). But when I increased the length of the list being computed by trying graphs on n=7 vertices (i.e. changing graphs(5) to graphs(7)), the result was this:

average list method time = 4.14753051996

average sum method time = 3.96504004002

In this case the sum method was slightly faster. All in all, the two methods are approximately the same speed but the difference MIGHT depend on the length of your list (it might also just be that I only averaged over 100 test runs, which isn't very high -- would have taken forever otherwise).