Submitting Spark job to Amazon EMR

Question

I'm about to try EMR and henceforth going trough the documentation right now. I'm rather a bit confused by the submit process.

1) Where are the spark Libraries

From the Spark documentation we find:

- spark.yarn.jars: List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

(a) I wonder how is that set with EMR i.e. is it set up by EMR or do I have to set that up myself ?

2) How to the --master parameter works ?

From the spark documentation we have:

- --master:Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.

(a) Is that set up by EMR directly ?

3) Is there a way to submit the application from by terminal, or is the only way to actually deploy the jar on S3 ? Can I log on to the master and do the submit from there ? Will all the Env variable necessary for the submit-script to work ready (see previous question) ? What is the most productive way to do this submit ?

Naveen Cotha Naveen Cotha · Accepted Answer · 2018-11-01T14:00:25

Where are the spark Libraries? spark is available in the path, meaning, you can run spark-submit from Command line interface anywhere on the master node, however, if you want to tweak the config files of spark, they are located under /etc/spark/conf/ on all nodes.
How to submit Spark application? There are two ways
- a) CLI on the master node: issue spark-submit with all the params, ex: spark-submit --class com.some.core.Main --deploy-mode cluster --master yarn --jars s3://path_to_some_jar.jar
- b) AWS EMR Web console: Submitting a spark application from EMR web console means submitting an EMR step, an EMR step is basically a UI version of spark submit, more info here
How does the --master parameter works, is it set up by EMR directly? This is set automatically if you are using the AWS EMR step (ie web console way), the UI will automatically add this for you, but if you are using the CLI as question 2a then you need to mention it specifically.

4) Is the only way to actually deploy the jar on S3? There are two (or more) ways

a) Publish the jar(build file) to s3 and reference it when submitting.
b) Copy the jar to master with SCP, and reference it when submitting.

5) Will all the Env variable necessary for the submit-script to work ready?

Yes, for all the spark/yarn related env variables, if you add spark application to EMR, it is a fully configured ready-to-use spark cluster.
No, for all your custom env variables, one way you can achieve this is by leveraging the AWS EMR bootstrap action to execute a script, which only can be done during cluster creation, more info here

6) What is the most productive way to do this submit? This depends on the use case, if you can/want to manage the job yourself, simply do a spark-submit but to get the advantages of AWS EMR automatic debugging log, then AWS EMR step is the way to go.

Update:

7) How to change configurations of yarn, spark etc? Again there are two options

CLI: Hadoop conf files are located at /etc/hadoop/conf, modify these on the master node, you probably have to restart the yarn manager on the master node.
AWS Web Console: You can submit a configuration on the web console as mentioned here when creating a cluster, for example, if you want to enable YARN FAIR scheduling, the config JSON to supply will look like

{ 'classification': 'yarn-site', 'Properties': { 'yarn.resourcemanager.scheduler.class':'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler' } }

PS: I forgot to mention that, almost whatever you can do on the AWS web console, you can do the same programmatically with AWS CLI or AWS SDK.

Submitting Spark job to Amazon EMR

1 Answers