10
votes

I am running Spark with YARN.

From the link: http://spark.apache.org/docs/latest/running-on-yarn.html

I found explanation of different yarn modes, i.e. the --master option, with which Spark can run:

"There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN"

Hereby, I can only understand the difference is that where the driver is running, but I can not understand which is running faster. Morevover:

  • In case of running Spark-submit, the --master can be either client or cluster
  • Correspondingly Spark-shell's master option can be yarn-client but it does not support cluster mode

So I do not know how to make the choice, i.e. when to use spark-shell, when to use spark-submit, especially when to use client mode, when to use cluster mode

4
look at this related answermrsrinivas

4 Answers

14
votes

spark-shell should be used for interactive queries, it needs to be run in yarn-client mode so that the machine you're running on acts as the driver.

For spark-submit, you submit jobs to the cluster then the task runs in the cluster. Normally you would run in cluster mode so that YARN can assign the driver to a suitable node on the cluster with available resources.

Some commands (like .collect()) send all the data to the driver node, which can cause significant performance differences between whether your driver node is inside the cluster, or on a machine outside the cluster (e.g. a users laptop).

7
votes

For learning purpose client mode is good enough. In production environment you should ALWAYS use cluster mode.

I'll explain you with help of an example. Imagine a scenario where you want to launch multiple applications.Let's say, you have a 5 node cluster with nodes A,B,C,D,E.

The work load will be distributed on all the 5 worker nodes and 1 node is additionally used to submit jobs as well (say 'A' is used for this). Now every-time you launch an application using the client mode, the driver process always run on 'A'.

It might work well for a few jobs but as the jobs keep increasing, 'A' will be short of resources like CPU and Memory.

Imagine the impact on a very large cluster which runs multiple such jobs.

But if you choose the cluster mode, the driver will run on 'A' everytime but be distributed on all the 5 nodes. The resources in this case are more evenly utilized.

Hope this helps you to decide what mode to choose.

1
votes

Client mode - Use for interactive queries, where you want to get the direct output (a local machine or edge node). This will run the driver in your local machine / edge node from where you have launched the application.

Cluster mode - This mode will help you launch the driver inside the cluster, irrespective of the machine that you have used to submit the application. YARN will add an application master where this driver will be created and hence become fault tolerant.

0
votes

Here is a link which is quite clear and simple.

In cluster mode, the Spark driver runs in the ApplicationMaster on a cluster host. A single process in a YARN container is responsible for both driving the application and requesting resources from YARN. The client that launches the application does not need to run for the lifetime of the application.

In client mode, the Spark driver runs on the host where the job is submitted. The ApplicationMaster is responsible only for requesting executor containers from YARN. After the containers start, the client communicates with the containers to schedule work.