I am running an AWS EMR cluster using yarn as master and cluster deploy mode. All of the tutorials I read runs spark-submit using AWS CLI in so called "Spark Steps" using a command similar to the following:
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name="Spark Program",ActionOnFailure=CONTINUE,Args=[--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/lib/spark-examples.jar,10]
My professor recommends I submit my spark applications by moving files to master node via SCP, then running the application via SSH:
ssh [email protected]
Then I would put the data files into HDFS via the shell. Then finally I would simply run spark-submit:
spark-submit --master yarn --deploy-mode cluster my_spark_app.py my_hdfs_file.csv
What is the difference between submitting a "Spark Step" through AWS CLI versus running spark-submit
via SSH into a master node? Will my Spark application still run in a distributed fashion by submitting the jobs from the master node?