At the moment of this writing, it seems there is no official way of achieving this. So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. It is nothing but a patch, but so far it has been useful for me. These are the steps I have followed:
Prerequisites
- Microsoft Powershell
- Azure Powershell
Process
Define an Oozie workflow *.xml* file:
<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">
<start to = "myAction"/>
<action name="myAction">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>myScript.cmd</exec>
<file>wasb://[email protected]/myScript.cmd#myScript.cmd</file>
<file>wasb://[email protected]/mySpark.jar#mySpark.jar</file>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar, on the wasb repository. It is then redirectioned to the local directory on which the Oozie job is executing.
Define the custom script
C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass
--master yarn-cluster
--deploy-mode cluster
--num-executors 3
--executor-memory 2g
--executor-cores 4
mySpark.jar
It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow:
wasb://[email protected]/
Define the powershell script
The powershell script is very much taken from the official Oozie on HDInsight tutorial. I am not including the script on this answer due to its almost absolute sameness with my approach.
I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission.