0
votes

On a distributed Hadoop cluster, can I copy the same hdfs-site.xml file to the namenodes and datanodes?

Some of the set-up instructions I've seen (i.e. Cloudera) say to have the dfs.data.dir property in this file on the datanodes and and the dfs.name.dir property in this file on the namenode. Meaning I should have two copies of hdfs-site.xml, one for the namenode and one for the datanodes.

But if it's all the same I'd rather just own/maintain one copy of the file and push it to ALL nodes anytime I change it. Is there any harm/risk in having both dfs.name.dir and dfs.data.dir properties in the same file? What issues might happen if a data node sees the property for "dfs.name.dir" ? And if there are issues, what other properties should be in the hdfs-site.xml file on the namenode but not on datanode? and vice versa.

And finally, what properties need to be included in the hdfs-site.xml file that I copy to a client machine (who isn't a tasktracker or datanode, but just talks to the Hadoop cluster) ?

I'v searched around, including the O'reilly operations book, but can't find any good article describing how the config file needs to differ across different nodes. Thanks!

1

1 Answers

0
votes

The namenode is picked up from masters file therefore essentially FSimage and edit logs will be written only on namenode and not in the datanode even though you copy the same hdfs-site.xml.

For the second question..You can't necessarily communicate with hdfs without being on the cluster directly. If you want to have a remote client you might try webhdfs and create certain web services using which you can write or access files in hdfs