8
votes

Suppose I have a number of lists of pairs (int, str), not necessarily of the same length. The only constraint here is that the lists are sorted in ascending order by their integer parts:

a = [(1, 'a'), (4, 'a'), (6, 'b'), (7, 'c'), (12, 'a')]
b = [(5, 'd'), (10, 'c'), (11,'e')]
c = [(0, 'b'), (3, 'd')]

What I would like to do is to emit the string elements in the order in which their corresponding integer elements occur i.e. in this case:

(0, 'b'), (1, 'a'), (3, 'd'), (4, 'a'), ... 

I am wondering if there is an obvious (nice + pythonic) way to do this just using iterators of a, b and c? I've looked at itertools but can't immediately see how to use the functionality in this case. The lists a, b, c might be very large so I'd like to do this without reading them into memory and then sorting ...

3
There is no way to do it without reading them all. If you don't read them all, you can't know whether the one you didn't read should actually have come out first. Also, if they're lists, they're already all in memory anyway. - BrenBarn

3 Answers

16
votes

Since the lists are already sorted, you can use heapq.merge:

>>> import heapq
>>> a = [(1, 'a'), (4, 'a'), (6, 'b'), (7, 'c'), (12, 'a')]
>>> b = [(5, 'd'), (10, 'c'), (11,'e')]
>>> c = [(0, 'b'), (3, 'd')]
>>> for i in heapq.merge(a, b, c):
...     i
...
(0, 'b')
(1, 'a')
(3, 'd')
(4, 'a')
(5, 'd')
(6, 'b')
(7, 'c')
(10, 'c')
(11, 'e')
(12, 'a')
>>>

This is also very efficient with large lists since it uses iterators internally. From the docs link given above:

Similar to sorted(itertools.chain(*iterables)) but returns an iterable, does not pull the data into memory all at once, and assumes that each of the input streams is already sorted (smallest to largest).

4
votes
my_iterator = iter(sorted(a+b+c))

is by far the most pythonic imho (although you can probably just leave it as a list and not wrap the extra iter

you could certainly speed it up if this was a bottleneck(which I doubt it is)

0
votes

heapq.merge is likely the best choice. FWIW more_itertools also offers a mergesort tool, similar to the accepted accepted answer:

import operator as op

import more_itertools

list(more_itertools.collate(a, b, c, key=op.itemgetter(0)))

Output

[(0, 'b'),
 (1, 'a'),
 (3, 'd'),
 (4, 'a'),
 (5, 'd'),
 (6, 'b'),
 (7, 'c'),
 (10, 'c'),
 (11, 'e'),
 (12, 'a')]

See more_itertools docs for more information.