Apache Spark DAG behaviour cogrouped operation

Question

I would like some clarifications about the DAG behaviour, and how exactly has been handle the following job:

val rdd = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,3)
.partitionBy(new HashPartitioner(4))

val rdd1 = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,2)
.partitionBy(new HashPartitioner(3))

val rdd2 = rdd.join(rdd1)
rdd2.collect()

This is the related rdd2.toDebugString:

(4) MapPartitionsRDD[6] at join at IntegrationStatusJob.scala:92 []
 |  MapPartitionsRDD[5] at join at IntegrationStatusJob.scala:92 []
 |  CoGroupedRDD[4] at join at IntegrationStatusJob.scala:92 []
 |  ShuffledRDD[1] at partitionBy at IntegrationStatusJob.scala:90 []
 +-(3) ParallelCollectionRDD[0] at parallelize at IntegrationStatusJob.scala:90 []
 +-(3) ShuffledRDD[3] at partitionBy at IntegrationStatusJob.scala:91 []
    +-(2) ParallelCollectionRDD[2] at parallelize at IntegrationStatusJob.scala:91 []

This is the spark UI image:

Looking at the toDebugString and at the spark UI, if I understood well, in order to perform the join, the DAG looks at what partitioner should be used and because both rdds are HashPartitioned,it choose the partitioner with the greater number of partitions, so rdd partitioner.

Now from the spark UI, it seems that rdd partitionBy and join being performed in the same stage, so under this conditions, the shuffle needed for to perform the join, will be done just from one side? From one side, I mean that just the rdd1 will be shuffled and no both.

Is my assumption correct?

user7757642 user7757642 · Accepted Answer · 2017-03-23T15:07:14

You right. If both RDDs are partitioned using different partitioner Spark will pick one as a reference and reparation / shuffle only the second one.

If both have the same partitioner there is no need for a shuffle.

Apache Spark DAG behaviour cogrouped operation

1 Answers