How to decide the --executor memory and --num-of-executors in spark submit job . What is the concept of -number-of-cores.
Also the clear difference between cluster and client deploy mode. How to choose the deploy mode
The first part of your question where you ask about --executor-memory
, --num-executors
and --num-executor-cores
usually depends on the variety of task your Spark application is going to perform.
tasks
. These tasks are performed in executor cores (or processors). This helps you to achieve parallelism within a certain executor but make sure you don't allocate all the cores of a machine to its executor because some are needed for normal functioning of it.On to your second part of the question, we have two --deploy-mode
in Spark that you have already named i.e. cluster
and client
.
client
mode is when you connect an external machine to a cluster and you run a spark job from that external machine. Like when you connect your laptop to a cluster and run spark-shell
from it. The driver JVM is invoked in your laptop and the session is killed as soon as you disconnect your laptop. Similar is the case for a spark-submit
job, if you run a job with --deploy-mode client
, your laptop acts like the master but the job is killed as soon as it is disconnected (not sure about this one).cluster
mode: When you specify --deploy-mode cluster
in your job then even if you run it using your laptop or any other machine, the job (JAR) is taken care of by the ResourceManager and ApplicationMaster, just like any other application in YARN. You won't be able to see the output on your screen but anyway most complex Spark jobs write to a FS so that's taken care of that way.