
I have recently setup an Multinode Hadoop HA (Namenode & ResourceManager) Cluster (3 node) , The installation is completed and all daemon's run as expected

Daemon in NN1 :

2945 JournalNode
3137 DFSZKFailoverController
6385 Jps
3338 NodeManager
22730 QuorumPeerMain
2747 DataNode
3228 ResourceManager
2636 NameNode

Daemon in NN2 :

19620 Jps
3894 QuorumPeerMain
16966 ResourceManager
16808 NodeManager
16475 DataNode
16572 JournalNode
17101 NameNode
16702 DFSZKFailoverController

Daemon in DN1 :

12228 QuorumPeerMain
29060 NodeManager
28858 DataNode
29644 Jps
28956 JournalNode

I am interested to run Spark Jobs on my Yarn setup. I have installed Scala and Spark on my NN1 and i can successfully start my spark by issuing the following command

$ spark-shell

Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.

Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ? If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.

I have copied over the Spark assembly JAR to the HDFS as suggested in a blog i read ,

-rw-r--r--   3 hduser supergroup  187548272 2016-04-04 15:56 /user/spark/share/lib/spark-assembly.jar

Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.

[hduser@ptfhadoop01v spark-1.6.0]$ ./bin/spark-submit --class     org.apache.spark.examples.SparkPi --master yarn  --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 2 --queue thequeue lib/spark-examples*.jar 10
16/04/04 17:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/04 17:27:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --num-executors to specify the number of executors
 - spark.executor.instances to configure the number of instances in the spark config.

16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment.  This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment.   This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at    org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at   org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at   org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/04/04 17:27:58 WARN MetricsSystem: Stopping a MetricsSystem that is not running
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at   org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[hduser@ptfhadoop01v spark-1.6.0]$

Please help me to resolve this and on how to run Spark on Yarn as client or as Cluster mode.

Can someone specify the basic configuration need to be set for spark-env.sh and spark-defaults.conf to kick start spark shell as yarn-client. i cannot find any sample files to refer into.Ashesh Nair

3 Answers


Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.

It's highly recommended that you read the official documentation of Spark on YARN at http://spark.apache.org/docs/latest/running-on-yarn.html.

You can use spark-shell with --master yarn to connect to YARN. You need to have proper configuration files on the machine you do spark-shell from, e.g. yarn-site.xml.

Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ?

No. You don't have to install anything on YARN since Spark will distribute necessary files for you.

If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.

Start with spark-shell --master yarn and see if you can execute the following code:

(0 to 5).toDF.show

If you see a table-like output, you're done. Else, provide the error(s).

Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.

Remove the SPARK_JAR variable. Don't use it as it's not needed and might cause troubles. Read the official documentation at http://spark.apache.org/docs/latest/running-on-yarn.html to understand the basics of Spark on YARN and beyond.


By adding this property into hdfs-site.xml , it solved the issue


In the client mode you'd run it something like below for simple word count example

spark-submit --class org.sparkexample.WordCount --master yarn-client wordcount-sample-plain-1.0-SNAPSHOT.jar input.txt output.txt

I think you got the spark-submit command wrong there. There is no --master yarn set up. I would highly recommend using an automated provisioning tool to set up your cluster quickly instead of a manual approach.

Refer to Cloudera or Hortonworks tools. You can use it to get setup in no time and be able to submit jobs easily without doing all these configurations manually.

Reference: https://hortonworks.com/products/hdp/