3
votes

I have an ordered RDD of type ((id, ts), some value). This was partitioned using a custom partitioner on the id field only.

math.abs(id.hashCode % numPartitions)

Now If I run the following two functions on this partitioned RDD, will it involve shuffling and re-partitioning of the dataset.

val partitionedRDD: ((id:Long, ts:Long), val:String) = <Some Function>
val flatRDD = orderedRDD.map(_ => (_._1.id, (_._1.ts, _._2)))

What I want to know is, whether flatRDD.groupByKey() and flatRDD.reduceByKey() will have the same partitioning as the partitionedRDD or Spark will shuffle the dataset again and create new partitions?

Thanks, Devi

1

1 Answers

3
votes

Yes, performing groupByKey or reduceByKey on flatRDD will necessarily cause another shuffle.

Since you know your flatRDD is already partitioned by id, you can safely assume that all records with the same id reside within a single partition. Therefore, if you want to groupBy(id), you can use mapPartitions (with preservesPartitioning = true) and perform that operation on each partition separately, thus preventing Spark from shuffling your data:

flatRDD.mapPartitions({ it =>
  it.toList
    .groupBy(_._1).mapValues(_.size) // some grouping + reducing the result
    .iterator

}, preservesPartitioning = true)

This will not cause an extra shuffle:

enter image description here