Running HDInsight Jobs programmatically - .jar file on cluster node, not in blob storage

Question

I followed this tutorial on submitting mapreduce jobs to HDInsight from a .NET console app.

It works fine, but am wondering about this line:

var jobDefinition = new MapReduceJobCreateParameters()
{
            JarFile = "wasb:///example/jars/hadoop-examples.jar",
            ClassName = "wordcount"
};

"wasb:///example/jars/hadoop-examples.jar" refers to a jar in my Azure storage account that was automatically put there when I connected the account to my new HDInsight cluster.

Moving beyond the examples (I want to use Mahout)... can I reference a jar that I have added to the cluster node? I installed mahout into the apps/dist directory by RDP. I can run Mahout jobs from there just fine, but I can't put these two steps together.

It feels like I shouldn't have to add jar files to blob storage to use them.

Jonathan Gao Jonathan Gao · Accepted Answer · 2014-02-11T20:14:05

HDInsight uses WASB (Windows Azure storage - Blob). It is an HDFS implementation on Windows Azure blob storage. If you can use "hadoop fs -ls" to list the jar file on HDInsight, the file is already on WASB. You can use the WASB syntax to reference the jar file. For more information, see http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-blob-storage/.

There are some restrictions on customizing an HDInsight cluster. There are two supported ways for customizing HDInsight clusters. One is using configuration file during the provision process. The second one is to run some native Java components that can be run on HDInsight cluster as Jar files. Installing applications via RDP is not supported. Mohout should be the second supported case. If the Mohout jar file is not on WASB, you can upload the jar files to WASB using "hadoop fs -copyFromLocal" or using Windows Azure PowerShell. For a list of upload methods, see http://www.windowsazure.com/en-us/documentation/articles/hdinsight-upload-data/.

Running HDInsight Jobs programmatically - .jar file on cluster node, not in blob storage

1 Answers