I do have a doubt on spark : HDFS blocks vs Cluster cores vs rdd Partitions.
Assume I am trying to process a file in HDFS (say block size is 64 MB and file is 6400 MB). So Ideally it do have 100 splits.
My cluster do have 200 cores in total , and I submitted the jobs with 25 Executors with 4 cores each (means 100 parallel tasks can run).
In nutshell I do have 100 partition by default in rdd and 100 cores will run.
Is this a good approach , or should I repartition the data to 200 partition and use all core in cluster ?