2
votes

How can I cross combine (is this the correct way to describe?) the two RDDS?

input:

rdd1 = [a, b]
rdd2 = [c, d]

output:

rdd3 = [(a, c), (a, d), (b, c), (b, d)]

I tried rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y)), it complains that It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.. I guess that means you can not nest action as in the list comprehension, and one statement can only do one action.

2

2 Answers

3
votes

So as you have noticed you can't perform a transformation inside another transformation (note that flatMap & map are transformations rather than actions since they return RDDs). Thankfully, what your trying to accomplish is directly supported by another transformation in the Spark API - namely cartesian (see http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD ).

So you would want to do rdd1.cartesian(rdd2).

1
votes

You can use the cartesian transformation. Here's an example from the documentation:

>>> rdd = sc.parallelize([1,2])
>>> sorted(rdd.cartesian(rdd).collect())
[(1, 1), (1, 2), (2, 1), (2, 2)]

in your case, you'll do rdd3 = rdd1.cartesian(rdd2)