6
votes

I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh and the XML files like core-default.xml.

  • At checkin time: I can modify the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
  • At run time: I generate the temporary directory that I want to use for my data storage.

I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR and HADOOP_PID_DIR, and if necessary I can modify the shell scripts to read those environment variables.

However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir in core-default.xml and dfs.datanode.data.dir in hdfs-default.xml.

Is there any way to edit these XML files to determine the value of hadoop.tmp.dir at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir?

1
Very interesting question! Out of curiosity and maybe future reference, why do you need such functionality?vefthym
@vefthym: It's for automated testing of how my software integrates with Hadoop. The machine on which the test will be running doesn't normally have Hadoop running on it, and must not have Hadoop running on it afterwards; but I have free rein for the duration of the test to run whatever processes I want.AlexC

1 Answers

5
votes

We had a similar requirement earlier. Configuring dfs.data.dir and dfs.name.dir as part of HADOOP_OPTS worked well for us. For e.g.

export HADOOP_OPTS="-Ddfs.name.dir=$NAMENODE_DATA -Ddfs.data.dir=$DFS_DATA"

This method can be used to configure other configurations also, like namenode url.