When trying to submit a Hadoop MapReduce job programmatically (from a Java application using the dataproc library), the job fails immediately. When submitting that exact same job through the UI, it works fine.
I've tried SSHing onto the Dataproc cluster to confirm the file exists, to check permissions, and changed the jar reference. Nothing has worked yet.
The error I'm getting:
Exception in thread "main" java.lang.ClassNotFoundException: file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:18)
Job output is complete
When I clone the failed job in the console and look at the REST equivalent this is what I see:
POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
"projectId": "project-id",
"job": {
"reference": {
"projectId": "project-id",
"jobId": "jobDoesNotWork"
},
"placement": {
"clusterName": "cluster-name",
"clusterUuid": "uuid"
},
"submittedBy": "[email protected]",
"jobUuid": "uuid",
"hadoopJob": {
"args": [
"-Dmapred.reduce.tasks=20",
"-Dmapred.output.compress=true",
"-Dmapred.compress.map.output=true",
"-Dstream.map.output.field.separator=,",
"-Dmapred.textoutputformat.separator=,",
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
"-Dmapreduce.input.fileinputformat.split.minsize=268435456",
"-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
"-mapper",
"/bin/cat",
"-reducer",
"/bin/cat",
"-inputformat",
"org.apache.hadoop.mapred.lib.CombineTextInputFormat",
"-outputformat",
"org.apache.hadoop.mapred.TextOutputFormat",
"-input",
"gs://input/path/",
"-output",
"gs://output/path/"
],
"mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
}
}
}
When I submit the job through the console it works. The REST equivalent of that job:
POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
"projectId": "project-id",
"job": {
"reference": {
"projectId": "project-id,
"jobId": "jobDoesWork"
},
"placement": {
"clusterName": "cluster-name,
"clusterUuid": ""
},
"submittedBy": "[email protected]",
"jobUuid": "uuid",
"hadoopJob": {
"args": [
"-Dmapred.reduce.tasks=20",
"-Dmapred.output.compress=true",
"-Dmapred.compress.map.output=true",
"-Dstream.map.output.field.separator=,",
"-Dmapred.textoutputformat.separator=,",
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
"-Dmapreduce.input.fileinputformat.split.minsize=268435456",
"-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
"-mapper",
"/bin/cat",
"-reducer",
"/bin/cat",
"-inputformat",
"org.apache.hadoop.mapred.lib.CombineTextInputFormat",
"-outputformat",
"org.apache.hadoop.mapred.TextOutputFormat",
"-input",
"gs://input/path/",
"-output",
"gs://output/path/"
],
"mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
}
}
}
I ssh’ed into the box and confirmed that the file is, in fact, present. The only difference I can really see is the “submittedBy”. One works, one doesn't. I'm guessing this is a permission thing, but I cannot seem to tell where the permissions are being pulled from in each scenario. In both cases, the Dataproc cluster is created with the same service account.
Looking at permissions for that jar on the cluster I see:
-rw-r--r-- 1 root root 133856 Nov 27 20:17 hadoop-streaming-2.8.4.jar
lrwxrwxrwx 1 root root 26 Nov 27 20:17 hadoop-streaming.jar -> hadoop-streaming-2.8.4.jar
I tried changing the mainJarFileUri from pointing explicitly to the versioned jar to the link (since it had open permissions), but didn't really expect it to work. And it didn't.
Does anyone with more Dataproc experience have any idea what's going on here, and how I can resolve it?
gcloud dataproc jobs describe <jobid>of the failed job attempt and if possible, the snippet of code you use to programmatically construct the job settings? - Dennis Huo