I have a text file with 100 MB . I am reading that file and convert into a dataframe and cached it . The cached dataframe is having two partitions intwo different executors
I reason for cache is, that cache dataframe is used by 100 actions happening in my spark application
These 100 actions will read different files and also uses the cached data frame as well to join some joins
My cluster size is 100 node with 40 GB each and 24 cores each
My configuration in spark-submit command is below
MASTER_URL=yarn-cluster
NUM_EXECUTORS=10
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G
My question is
Do I need to read the 100 MB text file as a single partition as at the moment it reads as two partitions by default?
If I do that dies it reduce the shuffle?