How to add python spark step in EMR?

Question

I am trying to create a aws datapipeline task which will create an EMR cluster and run a simple wordcount.py spark program. I used the datapipeline definition where steps is simple as:

"myEmrStep": "s3://test/wordcount.py,s3://test/data/abc.txt,s3://test/output/outfile5/",

Now, when I activate the task, I get an error like:

Exception in thread "main" java.io.IOException: Error opening job jar: /mnt/var/lib/hadoop/steps/s-187JR8H3XT8N7/wordcount.py at org.apache.hadoop.util.RunJar.run(RunJar.java:160) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.(ZipFile.java:215) at

Seems like the steps is trying to run the program using java instead of python. Any idea, please.

Thanks.

luk75 luk75 · Accepted Answer · 2016-06-24T18:15:22

You need to tell the cluster that you want to run a spark command. You can find samples (including a spark pipeline) on github: https://github.com/awslabs/data-pipeline-samples

EMR clusters support 'spark-submit' format to start your command, please check this for details: http://spark.apache.org/docs/latest/submitting-applications.html

Best.

How to add python spark step in EMR?

2 Answers