1
votes

I have a text file with 100 MB . I am reading that file and convert into a dataframe and cached it . The cached dataframe is having two partitions intwo different executors

I reason for cache is, that cache dataframe is used by 100 actions happening in my spark application

These 100 actions will read different files and also uses the cached data frame as well to join some joins

My cluster size is 100 node with 40 GB each and 24 cores each

My configuration in spark-submit command is below

MASTER_URL=yarn-cluster
NUM_EXECUTORS=10
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G

My question is

Do I need to read the 100 MB text file as a single partition as at the moment it reads as two partitions by default?

If I do that dies it reduce the shuffle?

1
100MB is not a lot to be worried about, to be honest. You can read it as 2 partitions to increase parallelism, but once you cache, it's not gonna make a lot of difference.philantrovert

1 Answers

0
votes

I think you are already hinting at the 100MB text being able to be broadcasted, so that when you use it as part of joins, spark will use a Broadcast Join that are much faster.

In this case, I don't think you need to worry about too much about the num partitions. In fact, I think broadcast will be faster with more partitions since bit-torrent protocol is by default used in Spark latest releases for Broadcasting.

For your requirements, you have to bump up spark.sql.autoBroadcastJoinThreshold value to > 100M. Refer this thread