spark : HDFS blocks vs Cluster cores vs rdd Partitions

Question

I do have a doubt on spark : HDFS blocks vs Cluster cores vs rdd Partitions.

Assume I am trying to process a file in HDFS (say block size is 64 MB and file is 6400 MB). So Ideally it do have 100 splits.

My cluster do have 200 cores in total , and I submitted the jobs with 25 Executors with 4 cores each (means 100 parallel tasks can run).

In nutshell I do have 100 partition by default in rdd and 100 cores will run.

Is this a good approach , or should I repartition the data to 200 partition and use all core in cluster ?

if you do repartition it will add extra overhead and will take longer time. Better not to use repartition. — maogautam

moriarty007 moriarty007 · Accepted Answer · 2019-08-16T04:45:59

Since you have 200 cores in total, using all of them can improve the performance depending on what kind of workload you are running.

Configure your spark application to use 50 executor (i.e. all 200 cores can be used by Spark). Also Change your spark split size from 64 MB to 32 MB. This will make sure that 6400 MB file will be divided into 200 RDD partitions and so your entire cluster can be used by it.

Don't use repartition - it will be slow as it involves shuffle.

spark : HDFS blocks vs Cluster cores vs rdd Partitions

1 Answers