6
votes

I am trying to launch a spark task on a hadoop cluster using spark submit on YARN mode.

I am launching spark-submit from my development machine.

According to Running Spark On YARN docs, I am supposed to provide a path for the hadoop cluster configuration on the env var HADOOP_CONF_DIR or YARN_CONF_DIR. This is where it gets tricky: Why do these folders must exist on my local machine if I am sending the task to a remote YARN service? Does this mean that spark-submit must be located inside the cluster and therefore I cannot launch a spark task remotely? If not, what should I populate these folders with? Should I copy the hadoop configuration folder from the YARN cluster node where the task manager service resides?

1

1 Answers

11
votes

1) When submitting a job Spark needs to know what it is connecting to. The files are parsed and required configuration is being used to connect to Hadoop cluster. Note that in documentation they say that it is client side configuration (right in the first sentence), meaning that you actually do not need all the configurations to connect to the cluster in the file (to connect to non-secured Hadoop cluster with minimalist configuration) you will need at least the following configs present:

  • fs.defaultFS (in case you intent to read from HDFS)
  • dfs.nameservices
  • yarn.resourcemanager.hostname or yarn.resourcemanager.address
  • yarn.application.classpath
  • (others might be required, depending on the configuration)

You can avoid having files, by setting the same settings in the code of the job you are submitting:

SparkConf sparkConfiguration = new SparkConf();
sparkConfiguration.set("spark.hadoop.fs.defaultFS", "...");
...

2) Spark submit can be located on any machine, not necessarily on the cluster, as long as it knows how to connect to the cluster (you can even run the submission from Eclipse, without installing anything, but project dependencies, related to Spark).

3) You should populate the configuration folders with:

  • core-site.xml
  • yarn-site.xml
  • hdfs-site.xml
  • mapred-site.xml

Copying those files from the server is an easiest approach to start with. After you can remove some configuration which is not required by spark-submit or may be security-sensitive.