I am a complete novice in Spark and just started exploring more on this. I have chosen the longer path by not installing hadoop using any CDH distribution and i have installed Hadoop from Apache website and setting the config file myself to understand more on the basics.
I have setup a 3 node cluster (All node are VM machine created from ESX server). I have setup High Availability for both Namenode and ResourceManager by using zookeeper mechanism. All three nodes are being used as DataNode as well.
The Following Daemons are Running across All three Nodes
Daemon in Namenode 1 Daemon In Namenode 2 Daemon in Datanode
8724 QuorumPeerMain 22896 QuorumPeerMain 7379 DataNode
13652 Jps 23780 ResourceManager 7299 JournalNode
9045 DFSZKFailoverController 23220 DataNode 7556 NodeManager
9175 DataNode 23141 NameNode 7246 QuorumPeerMain
9447 NodeManager 27034 Jps 9705 Jps
8922 NameNode 23595 NodeManager
8811 JournalNode 22955 JournalNode
9324 ResourceManager 23055 DFSZKFailoverController
I have setup HA for NN and RM in NameNode 1 & 2. The Nodes are of very minimum Hardware configuration (4GM RAM each and 20GB Disk Space) , But these are just for testing purpose. so i guess its ok.
I have installed Spark (Compatible version to my installed Hadoop 2.7) in NameNode 1. I am able to start Spark-shell locally and perform basic scala command to create RDD and perform some Actions over it. I also manage to test run SparkPi example as Yarn-Cluster and Yarn-Client deployment Mode. All works well and good.
Now my problem is , In real time scenario , We are going to write (Java, scala or py) based code in our local machine (Not in the nodes which form the Hadoop Cluster). Lets say i have another machine in same network as my HA Cluster is.How do i submit my Job to the Yarn-Cluster (Lets say i want to try submitting SparkPi) example from a host not in HA to the Yarn RM , How do i do this ?
I believe , SPARK has to be installed in machine where i am writing my code from (Is my assumption correct) and No spark needs to be installed in the HA Cluster. I also want to get the output of the submitted job back to the Host from where it was submitted. I have no idea what needed to be done to make this work.
I have heard of Spark JobServer , Is this what i need to get this all up and running ? I believe you guys can help me out with this confusion. I just could not find any document which clearly specify the steps to follow to get this done. Can i submit a job from Windows based machine to my HA cluster setup in unix environment ?