Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode

Question

I am running Spark with YARN.

From the link: http://spark.apache.org/docs/latest/running-on-yarn.html

I found explanation of different yarn modes, i.e. the --master option, with which Spark can run:

"There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN"

Hereby, I can only understand the difference is that where the driver is running, but I can not understand which is running faster. Morevover:

In case of running Spark-submit, the --master can be either client or cluster
Correspondingly Spark-shell's master option can be yarn-client but it does not support cluster mode

So I do not know how to make the choice, i.e. when to use spark-shell, when to use spark-submit, especially when to use client mode, when to use cluster mode

Ewan Leith Ewan Leith · Accepted Answer · 2015-10-21T09:17:00

spark-shell should be used for interactive queries, it needs to be run in yarn-client mode so that the machine you're running on acts as the driver.

For spark-submit, you submit jobs to the cluster then the task runs in the cluster. Normally you would run in cluster mode so that YARN can assign the driver to a suitable node on the cluster with available resources.

Some commands (like .collect()) send all the data to the driver node, which can cause significant performance differences between whether your driver node is inside the cluster, or on a machine outside the cluster (e.g. a users laptop).

Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode

4 Answers