I'm new to Hadoop administration :)
I have a Apache Hadoop 2.4.1 cluster of 8 nodes, 16TB DFS used(couldn't find the replication factor in any of the xml files), Hive 0.13 with a MySQL metastore.
Objective : Backup the data on the cluster to a NFS drive, uninstall the cluster, install some other distro(Cloudera, Hortonworks) and reload the data from the NFS drive to this new cluster.
There are two Hive tables of 956GB(roughly 9 billion rows) and 32Gb(few million rows) and few other smaller tables.
Concerns/Queries :
- How do I backup the entire cluster on the NFS drive? Currently I have an independent machine(not a part of the cluster) with the NFS drive mounted
- The crudest way is to export tables to csv/tsv files to the NFS drive and load these in the new cluster when its ready but exporting these big tables to csv/tsv is making me uncomfortable but I couldn't think of other way
- distcp works at HDFS level as per my understandings so I'm not sure if I can use it for faster copy from HDFS to NFS and NFS to new HDFS. This is because then I also need to backup the Hive metadata and then make it work with the new distro which may not be possible
How shall I proceed with this migration or at least the data transfer from HDFS to NFS and back ?