I am trying to run an spark job on EMR using the aws cli.
What I want is to have the server startup, run the job, and terminate.
I am able to do it as a two step process (first fire up the server, then run the job), but when I send one command I get an error.
error:
Error: Cannot load main class from JAR file:/home/hadoop/spark/bin/spark-submit
Run with --help for usage help or --verbose for debug output
Command exiting with ret '1'
So it looks like it can't find the jar (or the main class). I have set the master to yarn-cluster so that it should lookup the jars on s3, and I am 100% sure that the classpath of main class is correct.
Command
aws emr create-cluster --name "Test auto run" --release-label emr-5.4.0 \
--applications Name=Spark --ec2-attributes KeyName=key-emr --instance-type m3.xlarge --instance-count 2 \
--log-uri s3://test/emr --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,,\
Args=[/home/hadoop/spark/bin/spark-submit,--verbose,--master,yarn-cluster,--class,co.path.test.TestJob,s3://test/test-0.0.1-SNAPSHOT-jar-with-dependencies.jar,\
's3://test/test-messages/1998*','d','s3://test/loaded'] \
--use-default-roles --auto-terminate
The controller says this is being executed:
/usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit
/home/hadoop/spark/bin/spark-submit --verbose --master yarn-cluster --class,co.path.test.TestJob s3://test/test-0.0.1-SNAPSHOT-jar-with-dependencies.jar s3://test/test-messages/1998* d s3://test/loaded
Any ideas what am I messing up?