Merge multiple RDD in a specific order

Question

I am trying to merge multiple RDDs of strings to a RDD of row in a specific order. I've tried to create a Map[String, RDD[Seq[String]]] (where the Seq contains only one element) and then merge them to a RDD[Row[String]], but it doesn't seems to work (content of RDD[Seq[String]] is lost).. Do someone have any ideas ?

val t1: StructType
val mapFields: Map[String, RDD[Seq[String]]]
var ordRDD: RDD[Seq[String]] = context.emptyRDD
t1.foreach(field => ordRDD = ordRDD ++ mapFiels(field.name))
val rdd = ordRDD.map(line => Row.fromSeq(line))

EDIT : Using zip function lead to a spark exception, because my RDDs didn't have the same number of elements in each partition. I don't know how to make sure that they all have the same number of elements in each partition, so I've just zip them with index and then join them in good order using a ListMap. Maybe there is a trick to do with the mapPartitions function, but I don't know enough the Spark API yet.

val mapFields: Map[String, RDD[String]]
var ord: ListMap[String, RDD[String]] = ListMap()
t1.foreach(field => ord = ord ++ Map(field.name -> mapFields(field.name)))
// Note : zip = SparkException: Can only zip RDDs with same number of elements in each partition
//val rdd: RDD[Row] = ord.toSeq.map(_._2.map(s => Seq(s))).reduceLeft((rdd1, rdd2) => rdd1.zip(rdd2).map{ case (l1, l2) => l1 ++ l2 }).map(Row.fromSeq)
val zipRdd = ord.toSeq.map(_._2.map(s => Seq(s)).zipWithIndex().map{ case (d, i) => (i, d) })
val concatRdd = zipRdd.reduceLeft((rdd1, rdd2) => rdd1.join(rdd2).map{ case (i, (l1, l2)) => (i, l1 ++ l2)})
val rowRdd: RDD[Row] = concatRdd.map{ case (i, d) => Row.fromSeq(d) }
val df1 = spark.createDataFrame(rowRdd, t1)

what do you mean by "merge", do you mean each RDD will "contribute" one column to the result? And if so - what should happen if not all RDDs have the same size? — Tzach Zohar
Yes, each RDD will become a column. RDDs are supposed to have the same size. I don't think it's necessary to take this situation into account. — belgacea

Tzach Zohar Tzach Zohar · Accepted Answer · 2017-08-16T16:05:50

The key here is using RDD.zip to "zip" the RDDs together (creating an RDD in which each record is the combination of records with same index in ell RDDs):

import org.apache.spark.sql._
import org.apache.spark.sql.types._

// INPUT: Map does not preserve order (not the defaul implementation, at least) - using Seq
val rdds: Seq[(String, RDD[String])] = Seq(
  "field1" -> sc.parallelize(Seq("a", "b", "c")),
  "field2" -> sc.parallelize(Seq("1", "2", "3")),
  "field3" -> sc.parallelize(Seq("Q", "W", "E"))
)

// Use RDD.zip to zip all RDDs together, then convert to Rows
val rowRdd: RDD[Row] = rdds
  .map(_._2)
  .map(_.map(s => Seq(s)))
  .reduceLeft((rdd1, rdd2) => rdd1.zip(rdd2).map { case (l1, l2) => l1 ++ l2 })
  .map(Row.fromSeq)

// Create schema using the column names:
val schema: StructType = StructType(rdds.map(_._1).map(name => StructField(name, StringType)))

// Create DataFrame:
val result: DataFrame = spark.createDataFrame(rowRdd, schema)

result.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// |     a|     1|     Q|
// |     b|     2|     W|
// |     c|     3|     E|
// +------+------+------+

Merge multiple RDD in a specific order

1 Answers