I couldn't find anywhere how repartition is performed on a RDD internally? I understand that you can call repartition method on a RDD to increase the number of partition but how it is performed internally?
Assuming, initially there were 5 partition and they had -
- 1st partition - 100 elements
- 2nd partition - 200 elements
- 3rd partition - 500 elements
- 4th partition - 5000 elements
- 5th partition - 200 elements
Some of the partitions are skewed because they were loaded from HBase and data was not correctly salted in HBase which caused some of the region servers to have too many entries.
In this case, when we do repartition to 10, will it load all the partition first and then do the shuffling to create 10 partition? What if the full data cant be loaded into memory i.e. all partitions cant be loaded into memory at once? If Spark does not load all partition into memory then how does it know the count and how does it makes sure that data is correctly partitioned into 10 partitions.