2
votes

I am trying to create a aws datapipeline task which will create an EMR cluster and run a simple wordcount.py spark program. I used the datapipeline definition where steps is simple as:

"myEmrStep": "s3://test/wordcount.py,s3://test/data/abc.txt,s3://test/output/outfile5/",

Now, when I activate the task, I get an error like:

Exception in thread "main" java.io.IOException: Error opening job jar: /mnt/var/lib/hadoop/steps/s-187JR8H3XT8N7/wordcount.py at org.apache.hadoop.util.RunJar.run(RunJar.java:160) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(ZipFile.java:215) at

Seems like the steps is trying to run the program using java instead of python. Any idea, please.

Thanks.

2

2 Answers

1
votes

You need to tell the cluster that you want to run a spark command. You can find samples (including a spark pipeline) on github: https://github.com/awslabs/data-pipeline-samples

EMR clusters support 'spark-submit' format to start your command, please check this for details: http://spark.apache.org/docs/latest/submitting-applications.html

Best.

0
votes

In my pipeline definition I'm using the following, basically some jar that allows you to run a random command on the box... then I submit the spark job with the spark-submit command (that you can also use locally).

Make sure that all the paths that you use in the command are absolute, otherwise it mightn't work (I don't know which folder is the current directory)

This is on the Activity node in the pipeline: "step" : "/var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar,spark-submit,--py-files,/home/hadoop/lib.zip,/home/hadoop/analyse_exposures.py,#{myOutputS3Loc}/output/,#{myExposuresAnalysisWindowSize}"

Note also that I have a script that bootstraps the cluster to transfer over all code to the individual machines in the cluster so that they exist locally.

This is defined on the EMR-resource: "bootstrapAction": "#{myDeliverablesBucket}/emr/bootstrap.sh,#{myDeliverablesBucket}/emr/"

I know that it might not be the most flexible to copy over all resources as the cluster starts instead of reading them from S3 directly but it does the job.