I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms .
In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be divided into ,and based on the number of partitions, equal number of tasks are launched (correct me, if the assumption is wrong).for X number of cores in worker node.corresponding X number of task run at one time.
Along similar lines ,here are the few questions.
Since,All byKey operations along with coalesce, repartition,join and cogroup, causes data shuffle.
Is data shuffle another name for repartitiong operation?
What happens to the initial partitions(number of partitions declared)when repartitions happens.
Can someone give example(explain) how data movement across the cluster happens.i have seen couple of examples where random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.