I'm about to try EMR and henceforth going trough the documentation right now. I'm rather a bit confused by the submit process.
1) Where are the spark Libraries
From the Spark documentation we find:
- spark.yarn.jars: List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
- (a) I wonder how is that set with EMR i.e. is it set up by EMR or do I have to set that up myself ?
2) How to the --master parameter works ?
From the spark documentation we have:
- --master:Unlike other cluster managers supported by Spark in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.
- (a) Is that set up by EMR directly ?
3) Is there a way to submit the application from by terminal, or is the only way to actually deploy the jar on S3 ? Can I log on to the master and do the submit from there ? Will all the Env variable necessary for the submit-script to work ready (see previous question) ? What is the most productive way to do this submit ?