EMR spark job from aws-cli

Question

I am trying to run an spark job on EMR using the aws cli.

What I want is to have the server startup, run the job, and terminate.

I am able to do it as a two step process (first fire up the server, then run the job), but when I send one command I get an error.

error:

Error: Cannot load main class from JAR file:/home/hadoop/spark/bin/spark-submit
Run with --help for usage help or --verbose for debug output
Command exiting with ret '1'

So it looks like it can't find the jar (or the main class). I have set the master to yarn-cluster so that it should lookup the jars on s3, and I am 100% sure that the classpath of main class is correct.

Command

aws emr create-cluster --name "Test auto run" --release-label emr-5.4.0 \
--applications Name=Spark --ec2-attributes KeyName=key-emr --instance-type m3.xlarge --instance-count 2 \
--log-uri s3://test/emr --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,,\
Args=[/home/hadoop/spark/bin/spark-submit,--verbose,--master,yarn-cluster,--class,co.path.test.TestJob,s3://test/test-0.0.1-SNAPSHOT-jar-with-dependencies.jar,\
's3://test/test-messages/1998*','d','s3://test/loaded'] \
--use-default-roles --auto-terminate

The controller says this is being executed:

/usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit 
/home/hadoop/spark/bin/spark-submit --verbose --master yarn-cluster --class,co.path.test.TestJob s3://test/test-0.0.1-SNAPSHOT-jar-with-dependencies.jar s3://test/test-messages/1998* d s3://test/loaded

Any ideas what am I messing up?

jc mannem jc mannem · Accepted Answer · 2017-05-03T20:04:48

If the EMR step type is Spark which you mentioned on the step API --steps Type=Spark , as you identified on Step's controller logs, EMR will add the spark-submit command and you do not need to pass the /home/hadoop/spark/bin/spark-submit as arguments of the STEP API.

The error was because of two spark-submit's , where it was taking the second one /home/hadoop/spark/bin/spark-submit as argument.

Please see : http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html

EMR spark job from aws-cli

1 Answers