specify partitions size with spark

Question

I'm using spark for processing large files, I have 12 partitions. I have rdd1 and rdd2 i make a join between them, than select (rdd3). My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs. so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11) but i last partition still too big. How i can balance the size of my partitions ?

My i write my own custom partitioner?

Can you detail what you tried and what feedback you got indicating they did not work? — Scott Hunter
Also please include the code, so we can see when in the process you are repartitioning. — Glennie Helles Sindholt
need more details on your data. partitioning partitions on keys. If the majority of your keys are the same then one of the partition will be large. — nairbv

Alper t. Turker Alper t. Turker · Accepted Answer · 2017-07-28T14:44:27

I have rdd1 and rdd2 i make a join between them

join is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.

I'd consider adjusting the logic, so it doesn't require a full join.

specify partitions size with spark

1 Answers