1
votes

I'm new to Hadoop administration :)

I have a Apache Hadoop 2.4.1 cluster of 8 nodes, 16TB DFS used(couldn't find the replication factor in any of the xml files), Hive 0.13 with a MySQL metastore.

Objective : Backup the data on the cluster to a NFS drive, uninstall the cluster, install some other distro(Cloudera, Hortonworks) and reload the data from the NFS drive to this new cluster.

There are two Hive tables of 956GB(roughly 9 billion rows) and 32Gb(few million rows) and few other smaller tables.

Concerns/Queries :

  1. How do I backup the entire cluster on the NFS drive? Currently I have an independent machine(not a part of the cluster) with the NFS drive mounted
  2. The crudest way is to export tables to csv/tsv files to the NFS drive and load these in the new cluster when its ready but exporting these big tables to csv/tsv is making me uncomfortable but I couldn't think of other way
  3. distcp works at HDFS level as per my understandings so I'm not sure if I can use it for faster copy from HDFS to NFS and NFS to new HDFS. This is because then I also need to backup the Hive metadata and then make it work with the new distro which may not be possible

How shall I proceed with this migration or at least the data transfer from HDFS to NFS and back ?

2

2 Answers

0
votes

These are the steps we follow:

  1. Create new hadoop cluster
  2. Copy data to new cluster using distcp
  3. Drop the old cluster

If that is not an option

  1. Write shell script which can copy data using hadoop fs -get
  2. Make sure you apply the logic in such a way that same shell script can be run in parallel using nohup taking HDFS directory or file pattern as parameters
0
votes

Use Hadoop fs -get command to transfer the file to NAS. Assuming NAS is mounted on one of the hadoop nodes. For HIVE metadata run "SHOW CREATE TABLE tablename" command to get the create statement which can be run in the new cluster.

Even though the above steps fits your purpose. the recommended option will be to copy the data from existing to new cluster directly with DISTCP. and hive DDL scripts