3
votes

I have a RDD and every item is of the form

(key, [ele1, ele2, ele3, ..., elen])

Every item is a key value pair and the value is a list of elements.

I want to unpack the list such that I can create a new RDD with every item containing one element as follows:

(key, ele1)
(key, ele2)
(key, ele3)
.
.
.
(key, ele4)

How can I do this in PySpark?

I tried doing

RDD.flatmap(lambda line: line[1]) 

but that doesn't work.

1

1 Answers

7
votes

Such as this? I used str elements for simplicity.

>>> rdd = sc.parallelize([('key', ['ele1', 'ele2'])])
>>> rdd.flatMap(lambda data: [(data[0], x) for x in data[1]]).collect()
[('key', 'ele1'), ('key', 'ele2')]