1
votes

I'm working on a spark project and i'm using a hadoop cluster of 3 nodes with the following configuration:

  • 8cores and 16go of Ram (Namenode, Application Master, nodemanager and spark master and worker).
  • 4cores and 8go of Ram (datanode, nodemanager and worker)
  • 4cores and 4go of Ram (datanode, nodemanager and worker) so i'm using the following configuration :

    pyspark --master yarn-client --driver-memory 3g --executor-memory 1g --num-executors 3 --executor-cores 1

What's the best amount of executor, memory and cores tu use All my cluster performance?

2

2 Answers

1
votes

This essentially boils down to how much you need to process the data. If you have the whole cluster to process data you can use completely.

pyspark --master yarn-client --driver-memory 3g --executor-memory 1g --num-executors 3 --executor-cores 1

Here you aren't using the complete cluster. You are using 3gb driver and 1 gb executors with 3 executors meaning total 3gb of memory whereas you have 12 Gb memory in the cluster and 8 cores. One alternate configuration you could try

pyspark --master yarn-client --driver-memory 8g --executor-memory 3g --num-executors 4 --executor-cores 3

This uses the complete cluster.

However, the executor-memory configuration is mostly based on the job requirement. You need to tune that with multiple try. You can check this document for tuning.

0
votes

This blog post by Sandy Ryza nicely explains allocation of resources with various overheads, and here is a handy Excel cheat-sheet.

However if you're new to Spark and/or frequently change cluster size/type, might I suggest enabling dynamic allocation?