I have a RDD of (key, value) that I transformed into a RDD of (key, List(value1, value2, value3) as follow.
val rddInit = sc.parallelize(List((1, 2), (1, 3), (2, 5), (2, 7), (3, 10)))
val rddReduced = rddInit..groupByKey.mapValues(_.toList)
rddReduced.take(3).foreach(println)
This code give me the next RDD : (1,List(2, 3)) (2,List(5, 7)) (3,List(10))
But now I would like to go back to the rddInit from the rdd I just computed (the rddReduced rdd).
My first guess is to realise some kind of cross product between the key and each element of the List like this :
rddReduced.map{
case (x, y) =>
val myList:ListBuffer[(Int, Int)] = ListBuffer()
for(element <- y) {
myList+=new Pair(x, element)
}
myList.toList
}.flatMap(x => x).take(5).foreach(println)
With this code, I get the initial RDD as a result. But I don't think using a ListBuffer inside a spark job is a good practice. Is there any other way to resolve this problem ?
mapfollowed byflatMap(identity)=>flatMap. Useelement.map(Pair(...))- usingListBuffermakes the code too complicated. MakePaira case class. - Reactormonk