2
votes

I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms .

In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be divided into ,and based on the number of partitions, equal number of tasks are launched (correct me, if the assumption is wrong).for X number of cores in worker node.corresponding X number of task run at one time.

Along similar lines ,here are the few questions.

Since,All byKey operations along with coalesce, repartition,join and cogroup, causes data shuffle.

  1. Is data shuffle another name for repartitiong operation?

  2. What happens to the initial partitions(number of partitions declared)when repartitions happens.

  3. Can someone give example(explain) how data movement across the cluster happens.i have seen couple of examples where random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.

1

1 Answers

0
votes

First of all, HDFS blocks is divided into number of partition not in the blocks. These petitions resides in the work of memory. These partitions resides in the worker memory.

Q- Is data shuffle another name for repartitiong operation?

A- No. Generally repartition means increasing the existing partition in which the data is divided into into. So whenever we increase the partition, we are actually trying to “move” the data in number of new partitions set in code not “Shuffling” . Shuffling is somewhat when we move the data of particular key in one partition.

Q- What happens to the initial partitions(number of partitions declared)when repartitions happens? A- Covered above One more underlying thing is rdd.repartition(n) will not do change the no. Of partitions of rdd, its a tranformation, which will work when some other rdd is created like rdd1=rdd.repartition(n)

    Now it will create new rdd1 that have n number of partition.To do this, we can call coalesce function like rdd.coalesce(n) Being an action function, this will change the partitions of rdd itself.

Q- Can someone give example(explain) how data movement across across the cluster happens.i have seen couple of examples where random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.

Ans- partition and partitioning at two different different concept so partition is something in which the data is divided evenly in the number of partitions set by the user but in partitioning, data is shuffled among those partitions according to algorithms set by user like HashPartitioning & RangePartitioning.

Like rdd= sc.textFile(“../path”,5) rdd.partitions.size/length

O/p: Int: 5(No.of partitions)

rdd.partitioner.isDefined

O/p: Boolean= false

rdd.partitioner

O/p: None(partitioning scheme)

But,

rdd=sc.textFile(“../path”,5).partitionBy(new org.apache.spark.HashPartition(10).cache()

rdd.partitions.size

O/p: Int: 10

rdd.partitioner.isDefined

O/p: Boolean: true

rdd.partitioner

O/p: HashPartitioning@

Hope this will help!!!