Connect spark on yarn-cluster in CDH 5.4

Question

I am trying to understand the "concept" of connecting to a remote server. What I have are 4 servers on CentOS using CDH5.4 What I want to do is connect spark on yarn on all these four nodes. My problem is I do not understand how to set HADOOP_CONF_DIR as specified here. Where and what value should i set for this variable? And then do I need to set this variable on all four nodes or only the master node will suffice?

The documentation says "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". I have read many questions similar to this before asking it in here. Please, let me know what can I do to solve this problem. I am able to run spark and pyspark on stand alone mode on all nodes.

Thanks for your help. Ashish

suztomo suztomo · Accepted Answer · 2015-07-28T23:10:39

Where and what value should i set for this variable?

The variable HADOOP_CONF_DIR should point to the directory that contains yarn-site.xml. Usually you set it in ~/.bashrc. I found documentation for CDH. http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html

Basically all nodes need to have configuration files pointed by the environment variable.

Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines

Connect spark on yarn-cluster in CDH 5.4

1 Answers