0
votes

I have multiple RDDs, each consists of list of users. How can I get distinct union of each combination of these RDDs in a distributed way?

EDIT

Ok as I mentioned, it's not about getting the dictinct union of all RDDs and turning them into one RDD, it's about getting the distinct union of combinations of RDDs.

Let's say we've got three RDDs of same type, RDD1, RDD2 and RDD3, I want to get size of distinct union of each combination of them as follow:

sc.union(RDD1).distinct.count()
sc.union(RDD2).distinct.count()
sc.union(RDD3).distinct.count()
sc.union([RDD1,RDD2]).distinct().count()
sc.union([RDD1,RDD3]).distinct().count()
sc.union([RDD2,RDD3]).distinct().count()
sc.union([RDD1,RDD2,RDD3]).distinct().count()

since there's no RDD of RDDS in spark, I can't make an RDD of all combinations and map each combination of RDDs to get the result.

also as the number of RDDs increase, the number of combinations increase with 2^n. How can I achieve this goal?

Best Regards.

1
Do you want to union all the RDDs and get one RDD with distinct users?Notrius
sorry for the late response, please read the EDIT part. it's about combination of these RDDs.sleepy whiskey

1 Answers

0
votes

This is pretty simple if the RDDs are of the same type; just do:

rdd = sc.union([rdd1, rdd2, rdd3]).distinct()