Array[RDD[(String, Set[String])]] transformation in Spark Scala

Question

I have an Array of RDD of type Array[RDD[(String, Set[String])]], where each RDD is tuple of key and value. Key is String and Value is Set[String], i want to merge/union Set's with same key. I am trying to do this in scala but no joy. Can you please help me out.

e.g.
RDD["A",Set("1","2")]
RDD["A",Set("3","4")]
RDD["B",Set("1","2")]
RDD["B",Set("3","4")]
RDD["C",Set("1","2")]
RDD["C",Set("3","4")]

After transformation:
RDD["A",Set("1","2","3","4")]
RDD["B",Set("1","2","3","4")]
RDD["C",Set("1","2","3","4")]

Does the result have to be an Array of RDD or a single RDD with those tuples? — Mateusz Dymczyk

Mateusz Dymczyk Mateusz Dymczyk · Accepted Answer · 2016-03-14T06:32:59

If a single RDD as output is ok (don't really see any reason to make many RDDs with only 1 record in them) you can reduce your Array of RDD into a single RDD and then do a groupByKey:

arr.reduce( _ ++ _ )
   .groupByKey
   .mapValues(_.flatMap(identity))

Exampe:

scala> val x = sc.parallelize( List( ("A", Set(1,2)) ) )
scala> val x2 = sc.parallelize( List( ("A", Set(3,4)) ) )
scala> val arr = Array(x,x2)
arr: Array[org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Set[Int])]] = Array(ParallelCollectionRDD[0] at parallelize at <console>:27, ParallelCollectionRDD[1] at parallelize at <console>:27)
scala> arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity)).foreach(println)
(A,List(1, 2, 3, 4))

@Edit: I find it to be a really bad idea and would advice you to rethink it but you can get the result outut you want by getting all the keys from above and filtering the RDD multiple times:

val sub = arr.reduce( _ ++ _ ).groupByKey.mapValues(_.flatMap(identity))
val keys = sub.map(_._1).collect()
val result = for(k <- keys) yield sub.filter(_._1 == k)
result: Array[org.apache.spark.rdd.RDD[(String, Iterable[Int])]]

Each RDD will have a single tuple, don't really think it's very useful, performant.

Array[RDD[(String, Set[String])]] transformation in Spark Scala

1 Answers