0
votes

I am trying to understand the "concept" of connecting to a remote server. What I have are 4 servers on CentOS using CDH5.4 What I want to do is connect spark on yarn on all these four nodes. My problem is I do not understand how to set HADOOP_CONF_DIR as specified here. Where and what value should i set for this variable? And then do I need to set this variable on all four nodes or only the master node will suffice?

The documentation says "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster". I have read many questions similar to this before asking it in here. Please, let me know what can I do to solve this problem. I am able to run spark and pyspark on stand alone mode on all nodes.

Thanks for your help. Ashish

1

1 Answers

0
votes

Where and what value should i set for this variable?

The variable HADOOP_CONF_DIR should point to the directory that contains yarn-site.xml. Usually you set it in ~/.bashrc. I found documentation for CDH. http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html

Basically all nodes need to have configuration files pointed by the environment variable.

Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines