1
votes

I was reading about Narrow Vs Wide dependencies of an RDD partitioned across multiple partititon.

My Question: I do not understand that why RDDs built with Narrow Dependencies do not require a schuffle over the network? OR is it that shuffle DOES happens, but only a few number of times?

Please refer to the diagram below - enter image description here

Let's say a child RDD is created with Narrow Dependency from a parent RDD, as marked in the red rectangle below. Now, parent RDD had 3 partitions, let's say (P1,P2,P3) and data in each respective partition got mapped got mapped into 3 other partitions, let's say (P1,P4,P5) respectively.

Since, the data in parent RDD partition P1 got mapped to itself, so there is no shuffle over the network. But since the data from parent RDD partition P2 & P3 got mapped to child RDD partitions P4 & P5, which are different partitions, so naturally the data has to pass through the network to have the corresponding values placed in P4 & P5. Thus, why do we say that there is no shuffle over the network?

See the box in green, this is even more complex case. Only case which I could visualize where there is no shuffle over the network should be when parent RDD partitions get mapped to itself.

I am sure my reasoning is incorrect. Could someone provide some explanation? Thanks

3
My understanding was flawed. What happens is that when we apply a map() function on a RDD, the partitions do not change. At maximum, partitioner/hashing will be destroyed. So, parent RDD across 3 partitions (P1,P2,P3) will result in child RDD spread across exactly (P1,P2,P3) respectively, with each partition data being mapped one to one using map(function). Thus, there will be no shuffle in red box above.cph_sto
In green box, since, both the parent RDDs have the same partitioner (they are co-partitioned), so data with same keys will be on the same partition, thus no shuffling involved. Hence, ONLY that join() opearation will result in narrow dependency where both RDDs are partitioned with the same partitioner, otherwise join() operation will result in wide dependency, which means shuffling of data across network.cph_sto
Note: In the green box above. There are not 6 partitions, but 3, because the input RDDs are co-partitioned, i.e; partitioned with the same partitioner, resulting in elements with same keys ending on the same partition index. If RDD1 is placed on (P1,P2,P3), then RDD2 is also placed similarly on (P1,P2,P3). Just because there are 6 boxes doesn't imply 6 partitions ;)cph_sto

3 Answers

3
votes

Narrow dependency doesn't imply that there is no network traffic.

The distinction between narrow and wide is more subtle:

  • With wide dependency each child partition depends on each partition of its parents. It is many-to-many relationship.
  • With narrow dependency each child partition depends on at most one partition from each parent. It can be either one-to-one or many-to-one relationship.

If network traffic is required depends on other factors than transformation alone. For example co-partitioned RDDs can be joined without network traffic if shuffle happened during the same action (in this case there is both co-partitioning and co-location) or with network traffic otherwise.

1
votes

From the link you provided:

A typical execution sequence is as follows ... RDD is created originally from external data sources (e.g. HDFS, Local file ... etc) RDD undergoes a sequence of TRANSFORMATION (e.g. map, flatMap, filter, groupBy, join), each provide a different RDD that feed into the next transformation. Finally the last step is an ACTION (e.g. count, collect, save, take), which convert the last RDD into an output to external data sources The above sequence of processing is called a lineage (outcome of the topological sort of the DAG)

Now think about how the data is processed as it makes it's way through the pipeline.

If there is a narrow dependency, then the child partition is only dependent on 1 parent partition. The parent partition's data can be processed on 1 machine and the child partition can then exist on the same machine. No shuffling of data is necessary.

If there is a wide dependency, then 1 child partition is dependent on many parent partitions. The parent partitions may exist on many machines, so the data must be shuffled across the network in order to complete the data processing.

1
votes

Let me give an analogy to illustrate partitions. If you had a set of documents and wanted to filter it to identify all the improperly filled ones, it is equivalent to doing a filter operation. In order to speed up the operation, you distribute the set of documents to three people so each one of them has a partition of the documents. Each person then sifts through the subset of documents given to them (in the input box) and puts the ones that are improperly filled into an output box. The operation performed by each individual is such that the contents of the output box depend only upon the input box provided to them; the input box of other people has no bearing on the output. Hence requires no network transfer.

Hope this explains.