
I'm working on a spark project and i'm using a hadoop cluster of 3 nodes with the following configuration:

  • 8cores and 16go of Ram (Namenode, Application Master, nodemanager and spark master and worker).
  • 4cores and 8go of Ram (datanode, nodemanager and worker)
  • 4cores and 4go of Ram (datanode, nodemanager and worker) so i'm using the following configuration :

    pyspark --master yarn-client --driver-memory 3g --executor-memory 1g --num-executors 3 --executor-cores 1

What's the best amount of executor, memory and cores tu use All my cluster performance?


2 Answers


This essentially boils down to how much you need to process the data. If you have the whole cluster to process data you can use completely.

pyspark --master yarn-client --driver-memory 3g --executor-memory 1g --num-executors 3 --executor-cores 1

Here you aren't using the complete cluster. You are using 3gb driver and 1 gb executors with 3 executors meaning total 3gb of memory whereas you have 12 Gb memory in the cluster and 8 cores. One alternate configuration you could try

pyspark --master yarn-client --driver-memory 8g --executor-memory 3g --num-executors 4 --executor-cores 3

This uses the complete cluster.

However, the executor-memory configuration is mostly based on the job requirement. You need to tune that with multiple try. You can check this document for tuning.


This blog post by Sandy Ryza nicely explains allocation of resources with various overheads, and here is a handy Excel cheat-sheet.

However if you're new to Spark and/or frequently change cluster size/type, might I suggest enabling dynamic allocation?