0
votes

How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.

Also the clear difference between cluster and client deploy mode. How to choose the deploy mode

1

1 Answers

0
votes

The first part of your question where you ask about --executor-memory, --num-executors and --num-executor-cores usually depends on the variety of task your Spark application is going to perform.

  • Executor Memory indicates the amount of physical memory you want to allocate to the JVM that runs the executor. The value will depend on your requirement. For example, if you're just going to parse a large text file you'll require much less memory than what you need for, say, Image Processing.
  • The number of executors variable is the number of Executor JVMs you want to spawn on your cluster. Again, it depends on a lot of factors like your cluster size, type of machines in the cluster etc.
  • Each executor splits the code and performs the instructions in tasks. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.

On to your second part of the question, we have two --deploy-mode in Spark that you have already named i.e. cluster and client.

  • client mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit job, if you run a job with --deploy-mode client, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).
  • cluster mode: When you specify --deploy-mode cluster in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.