I was reading about Narrow Vs Wide dependencies of an RDD partitioned across multiple partititon.
My Question: I do not understand that why RDDs built with Narrow Dependencies do not require a schuffle over the network? OR is it that shuffle DOES happens, but only a few number of times?
Please refer to the diagram below -
Let's say a child RDD is created with Narrow Dependency from a parent RDD, as marked in the red rectangle below. Now, parent RDD had 3 partitions, let's say (P1,P2,P3) and data in each respective partition got mapped got mapped into 3 other partitions, let's say (P1,P4,P5) respectively.
Since, the data in parent RDD partition P1 got mapped to itself, so there is no shuffle over the network. But since the data from parent RDD partition P2 & P3 got mapped to child RDD partitions P4 & P5, which are different partitions, so naturally the data has to pass through the network to have the corresponding values placed in P4 & P5. Thus, why do we say that there is no shuffle over the network?
See the box in green, this is even more complex case. Only case which I could visualize where there is no shuffle over the network should be when parent RDD partitions get mapped to itself.
I am sure my reasoning is incorrect. Could someone provide some explanation? Thanks