I am fetching data from HDFS and storing it in a Spark RDD. Spark creates the number of partitions based on the number of HDFS blocks. This leads to a large number of empty partitions which also get processed during piping. To remove this overhead, I want to filter out all the empty partitions from the RDD. I am aware of coalesce and repartition, but there is no guarantee that all the empty partitions will be removed.
Is there any other way to go about this?
RangePartitioner
orHashPartitioner
. If not you can use partition based on random numbers. – zero323repartition
... – Glennie Helles Sindholt