Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

Question

Will rdd1.join(rdd2) cause a shuffle to happen if rdd1 and rdd2 have the same partitioner?

Can you rewrite this question to be more clear? Just because RDDs have partitions on the same machines doesn't mean all keys are always on the same partition across both. What are you asking then? — Sean Owen
I've rewritten the question completely. I think it makes sense now, but I'm not sure it's what @zwb meant. I did not really understand the original. Feel free to revert my edit and update the question if necessary. — Daniel Darabos
Thanks, i come from china and my english is poor,i can't express myself very clear and what you rewritten is my sense. — zwb

Daniel Darabos Daniel Darabos · Accepted Answer · 2015-02-08T21:42:01

No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala:

override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
    }
  }
}

Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).

This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.

Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?

1 Answers