9
votes

I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.

Is there any other way to go about this?

1
"This leads to a large number of empty partitions which also get processed during piping" I do not understand this sentence. Why and when are this empty partitions created?Mikel Urkia
Suppose I am fetching data using Hive and my hdfs has 500 file blocks for given Hive Table, in that case 500 partitions will be created in RDD. Later while doing a groupbykey, empty partitions are left.user3898179
If you have some a priori about your data you can repartition using either RangePartitioner or HashPartitioner. If not you can use partition based on random numbers.zero323
I'd say empty partitions are automatically deleted and not processed by Spark, although I am not 100% sure.Mikel Urkia
@MikelUrkia empty partitions are not deleted (you can see them in the Spark UI). I have, however, never experienced empty partitions after doing a repartition...Glennie Helles Sindholt

1 Answers

6
votes

There isn't an easy way to simply delete the empty partitions from a RDD.

coalesce doesn't guarantee that the empty partitions will be deleted. If you have a RDD with 40 blank partitions and 10 partitions with data, there will still be empty partitions after rdd.coalesce(45).

The repartition method splits the data evenly over all the partitions, so there won't be any empty partitions. If you have a RDD with 50 blank partitions and 10 partitions with data and run rdd.repartition(20), the data will be evenly split across the 20 partitions.