I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh
and the XML files like core-default.xml
.
- At checkin time: I can modify the shell scripts like
hadoop-env.sh
and the XML files likecore-default.xml
. - At run time: I generate the temporary directory that I want to use for my data storage.
I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR
and HADOOP_PID_DIR
, and if necessary I can modify the shell scripts to read those environment variables.
However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir
in core-default.xml and dfs.datanode.data.dir
in hdfs-default.xml.
Is there any way to edit these XML files to determine the value of hadoop.tmp.dir
at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir
?