Iterate over lines of document in parallel

Question

I have a document called words, and each line has a new word. I want to turn each one of these words into a list of its constituent characters. I do this by just doing list(x) where x is the word.

This is what I am doing, but I want a way to parallelize this:

split = []
with open('wordprob.txt','rt') as lines:
    for line in lines:
        split.append(list(line))

I am using this approach so that I do not have to load the entire file (over 3 gb) into memory. When parallelizing it by loading the file first, my memory usage goes over 16 gb.

How can I parallelize it without loading the file into memory, just like in the loop above?

Thanks!

EDIT: Below it was pointed out that the list will take up lots of memory. Instead, how would I store each list of characters (originally from a single word) as a space delimited string on a new line of a new document. Again, I'm looking for a parallelized solution.

But you are storing the entire list in memory, and the list will be much larger than the text file. Maybe tell us what you are trying to do? — Ray Toal
@RayToal Okay, makes sense. Can we store each set of characters as space delimited values in a new file, each word (now characters) on a separate line? — Keshav M

Ray Toal Ray Toal · Accepted Answer · 2018-05-09T06:25:11

If I understand the problem correctly, you have a file such as

sushi
banana
sujuk
strawberry
tomato
pho
ramen
manaqish

and you want to produce a new file like this:

s u s h i
b a n a n a
s u j u k
s t r a w b e r r y
t o m a t o
p h o
r a m e n
m a n a q i s h

then you can write a simple stdin-stdout program, something like

import sys
for line in sys.stdin:
    sys.stdout.write(' '.join(list(line)))

If all the data goes to the same file, then even if you parallelize, each of your threads or processes will be competing to write to the same output file.

If you really want to parallelize and you want to stick with Python, you can use Hadoop Streaming. Your job will be a mapper-only job; in fact the mapper is the three line script above. But I'm not sure this buys you much unless your data set is ridiculously large. The transformation is pretty simple, but feel free to profile the job to see if you get much benefit.

I don't think 3GB is very much, but this can be a fun exercise in Hadoop (or whatever the kids are using these days.)

Iterate over lines of document in parallel

1 Answers