In SPARK, why Narrow Dependency strictly doesn't require schuffle over the network?

Question

I was reading about Narrow Vs Wide dependencies of an RDD partitioned across multiple partititon.

My Question: I do not understand that why RDDs built with Narrow Dependencies do not require a schuffle over the network? OR is it that shuffle DOES happens, but only a few number of times?

Please refer to the diagram below -

Let's say a child RDD is created with Narrow Dependency from a parent RDD, as marked in the red rectangle below. Now, parent RDD had 3 partitions, let's say (P1,P2,P3) and data in each respective partition got mapped got mapped into 3 other partitions, let's say (P1,P4,P5) respectively.

Since, the data in parent RDD partition P1 got mapped to itself, so there is no shuffle over the network. But since the data from parent RDD partition P2 & P3 got mapped to child RDD partitions P4 & P5, which are different partitions, so naturally the data has to pass through the network to have the corresponding values placed in P4 & P5. Thus, why do we say that there is no shuffle over the network?

See the box in green, this is even more complex case. Only case which I could visualize where there is no shuffle over the network should be when parent RDD partitions get mapped to itself.

I am sure my reasoning is incorrect. Could someone provide some explanation? Thanks

My understanding was flawed. What happens is that when we apply a map() function on a RDD, the partitions do not change. At maximum, partitioner/hashing will be destroyed. So, parent RDD across 3 partitions (P1,P2,P3) will result in child RDD spread across exactly (P1,P2,P3) respectively, with each partition data being mapped one to one using map(function). Thus, there will be no shuffle in red box above. — cph_sto
In green box, since, both the parent RDDs have the same partitioner (they are co-partitioned), so data with same keys will be on the same partition, thus no shuffling involved. Hence, ONLY that join() opearation will result in narrow dependency where both RDDs are partitioned with the same partitioner, otherwise join() operation will result in wide dependency, which means shuffling of data across network. — cph_sto
Note: In the green box above. There are not 6 partitions, but 3, because the input RDDs are co-partitioned, i.e; partitioned with the same partitioner, resulting in elements with same keys ending on the same partition index. If RDD1 is placed on (P1,P2,P3), then RDD2 is also placed similarly on (P1,P2,P3). Just because there are 6 boxes doesn't imply 6 partitions ;) — cph_sto

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-05-03T18:06:39

Narrow dependency doesn't imply that there is no network traffic.

The distinction between narrow and wide is more subtle:

With wide dependency each child partition depends on each partition of its parents. It is many-to-many relationship.
With narrow dependency each child partition depends on at most one partition from each parent. It can be either one-to-one or many-to-one relationship.

If network traffic is required depends on other factors than transformation alone. For example co-partitioned RDDs can be joined without network traffic if shuffle happened during the same action (in this case there is both co-partitioning and co-location) or with network traffic otherwise.

In SPARK, why Narrow Dependency strictly doesn't require schuffle over the network?

3 Answers