4
votes

I am running an AWS EMR cluster using yarn as master and cluster deploy mode. All of the tutorials I read runs spark-submit using AWS CLI in so called "Spark Steps" using a command similar to the following:

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/lib/spark-examples.jar,10]

My professor recommends I submit my spark applications by moving files to master node via SCP, then running the application via SSH:

ssh [email protected]

Then I would put the data files into HDFS via the shell. Then finally I would simply run spark-submit:

spark-submit --master yarn --deploy-mode cluster my_spark_app.py my_hdfs_file.csv

What is the difference between submitting a "Spark Step" through AWS CLI versus running spark-submit via SSH into a master node? Will my Spark application still run in a distributed fashion by submitting the jobs from the master node?

1

1 Answers

5
votes

Submitting an EMR step is using Amazon's custom built step submission process which is a relatively light wrapper abstraction which itself calls spark-submit. Fundamentally, there is little difference, but if you wish to be platform agnostic (re not locked in to Amazon), use the SSH strategy or try even more advanced submission strategies like remote submission or one of my favorites, using Livy.