3
votes

I'm using Spark's python api.
I have a big text which I load with rdd = sc.loadtxt("file.txt") .
After that, I want to perform a mapPartitions transformation on the rdd.
However, I get access the each line of the text file in each partition only with a python iterator.
This is not the way I prefer to use the data and it costs in my app performance.

Is there some other ways for getting access to that text file on each partition?
For example : Getting it like a real txt file, 1 string where lines are seperated by \n ..

1
For example , I want to create a numpy array for each partition. I want that num of rows in the array will be num of lines in each partition - member555

1 Answers

2
votes

For starters you can use glom method which coalesces all elements within each partition into a list:

rdd = sc.parallelize(range(50), 5).map(str)
glomed = rdd.glom()

It means you'll get a generator which contains only a single element. Next you can simply join lines:

def do_something(iter):
    s = "\n".join(next(iter))  # For Python 2 use iter.next()
    # ... do something with s
    return ...

glomed.mapPartitions(do_something)

Even simpler approach is to omit glom and simply concatenate lines:

rdd.mapPartitions(lambda iter: ["\n".join(iter)]).first()
"0\n1\n2\n3\n4\n5\n6\n7\n8\n9'

Note:

In general there should be no need for that. Majority of Python modules work just fine with generators and there is definitely no performance penalty. Moreover content of the partition in case of text files depends almost exclusively on cluster settings not a data itself. Arguably it is not particularly useful.