3
votes

I am a complete novice in Spark and just started exploring more on this. I have chosen the longer path by not installing hadoop using any CDH distribution and i have installed Hadoop from Apache website and setting the config file myself to understand more on the basics.

I have setup a 3 node cluster (All node are VM machine created from ESX server). I have setup High Availability for both Namenode and ResourceManager by using zookeeper mechanism. All three nodes are being used as DataNode as well.

The Following Daemons are Running across All three Nodes

Daemon in Namenode 1          Daemon In Namenode 2       Daemon in Datanode         
8724 QuorumPeerMain           22896 QuorumPeerMain       7379 DataNode   
13652 Jps                     23780 ResourceManager      7299 JournalNode
9045 DFSZKFailoverController  23220 DataNode             7556 NodeManager
9175 DataNode                 23141 NameNode             7246 QuorumPeerMain
9447 NodeManager              27034 Jps                  9705 Jps
8922 NameNode                 23595 NodeManager
8811 JournalNode              22955 JournalNode
9324 ResourceManager          23055 DFSZKFailoverController

I have setup HA for NN and RM in NameNode 1 & 2. The Nodes are of very minimum Hardware configuration (4GM RAM each and 20GB Disk Space) , But these are just for testing purpose. so i guess its ok.

I have installed Spark (Compatible version to my installed Hadoop 2.7) in NameNode 1. I am able to start Spark-shell locally and perform basic scala command to create RDD and perform some Actions over it. I also manage to test run SparkPi example as Yarn-Cluster and Yarn-Client deployment Mode. All works well and good.

Now my problem is , In real time scenario , We are going to write (Java, scala or py) based code in our local machine (Not in the nodes which form the Hadoop Cluster). Lets say i have another machine in same network as my HA Cluster is.How do i submit my Job to the Yarn-Cluster (Lets say i want to try submitting SparkPi) example from a host not in HA to the Yarn RM , How do i do this ?

I believe , SPARK has to be installed in machine where i am writing my code from (Is my assumption correct) and No spark needs to be installed in the HA Cluster. I also want to get the output of the submitted job back to the Host from where it was submitted. I have no idea what needed to be done to make this work.

I have heard of Spark JobServer , Is this what i need to get this all up and running ? I believe you guys can help me out with this confusion. I just could not find any document which clearly specify the steps to follow to get this done. Can i submit a job from Windows based machine to my HA cluster setup in unix environment ?

2

2 Answers

0
votes

Spark JobServer provides rest interface for your requirement. Apart from that there are other features.

See https://github.com/spark-jobserver/spark-jobserver for more information.

0
votes

In order to submit spark jobs to the cluster your machine have to become a "gateway node". That basically means you have hadoop binaries/libraries/configs installed on that machine, but there are no hadoop daemons running on it.

Once you have it setup, you should be able to run hdfs commands against your cluster from that machine (like hdfs dfs -ls /), submit yarn applications to the cluster (yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar pi 3 100).

After that step you can install spark on your gateway machine and start submitting spark jobs. If you are going to use Spark-on-yarn, this is the only machine spark needs to be installed on.

You (your code) is the one responsible for getting the output of the job. You could choose to save the result in HDFS (the most common choice), print it to the console, etc... Spark's History Server is for debugging purposes.