I'm using VMs for my 5 node hadoop cluster (1VM has Name Node, 1VM has JobTracker/SecondaryNameNode/HMaster, and three VMs have DataNodes/TaskTrackers/HRegionServers/Zookeepers), which is of the Cloudera distribution and which I installed manually, as opposed to through Cloudera Manager.
Edit - The diskspace of each VM containing a DataNode is roughly 50-60% full. It would be nice of me to get this done by tomorrow morning but I could get away with 24 hours.
I have to return one of the VMs (specifically one particular DataNode) and replace it with another one (don't ask why). I have the second VM procured and can begin installing whenever I want.
Here is my current strategy:
- rsync DataNode's data directory to new Node, as well as the zookeeper's data directory.
- rsync all configuration files (core-site.xml,hdfs-site.xml,mapred-site.xml, hbase-site.xml, zoo.cfg)
- Ask this question on Stack Overflow
Why number three? The NameNode holds the meta data of the location of all the blocks of all the files stored on HDFS. The HBase Meta table points to the RegionServers which have the HFiles for its data. The Zookeeper server's data on the DataNode is essential, too.
How do I go about instructing the NameNode and HBase/Zookeeper to point to the data on the newly procured VM? What else am I not considering?
Now this is actually a dev environment and I could export the HDFS data and the HBase data using Pig, wipe all the DataNode's and Zookeeper's data directories clean, and import the data back using Pig. Aside from being lame, I believe this would be a good exercise for me.