2
votes

I couldn't find anywhere how repartition is performed on a RDD internally? I understand that you can call repartition method on a RDD to increase the number of partition but how it is performed internally?

Assuming, initially there were 5 partition and they had -

  • 1st partition - 100 elements
  • 2nd partition - 200 elements
  • 3rd partition - 500 elements
  • 4th partition - 5000 elements
  • 5th partition - 200 elements

Some of the partitions are skewed because they were loaded from HBase and data was not correctly salted in HBase which caused some of the region servers to have too many entries.

In this case, when we do repartition to 10, will it load all the partition first and then do the shuffling to create 10 partition? What if the full data cant be loaded into memory i.e. all partitions cant be loaded into memory at once? If Spark does not load all partition into memory then how does it know the count and how does it makes sure that data is correctly partitioned into 10 partitions.

1
@Krishna Kumar, the given answer isn't very clear and doesn't directly address the question. Were you able to find out the right answer?y2k-shubham
@y2k-shubham My understanding is that Spark will try to load everything in memory and will throw memory related exception if it cantKrishna Kumar

1 Answers

2
votes

From what I have understood, repartition will certainly trigger shuffle. From Job Logical Plan document following can be said about repartition

   - for each partition, every record is assigned a key which is an increasing number.
   - hash(key) leads to a uniform records distribution on all different partitions.

If Spark can't load all data into memory then memory issue will be thrown. So default processing of Spark is all done in memory i.e. there should always be sufficient memory for your data.
Persist option can be used to tell spark to spill your data in disk if there is not enough memory.
Jacek Laskowski also explains about repartitions.
Understanding your Apache Spark Application Through Visualization should be sufficient for you to test and know by yourself.