Hive, HDFS data to local system and back

Question

I'm new to Hadoop administration :)

I have a Apache Hadoop 2.4.1 cluster of 8 nodes, 16TB DFS used(couldn't find the replication factor in any of the xml files), Hive 0.13 with a MySQL metastore.

Objective : Backup the data on the cluster to a NFS drive, uninstall the cluster, install some other distro(Cloudera, Hortonworks) and reload the data from the NFS drive to this new cluster.

There are two Hive tables of 956GB(roughly 9 billion rows) and 32Gb(few million rows) and few other smaller tables.

Concerns/Queries :

How do I backup the entire cluster on the NFS drive? Currently I have an independent machine(not a part of the cluster) with the NFS drive mounted
The crudest way is to export tables to csv/tsv files to the NFS drive and load these in the new cluster when its ready but exporting these big tables to csv/tsv is making me uncomfortable but I couldn't think of other way
distcp works at HDFS level as per my understandings so I'm not sure if I can use it for faster copy from HDFS to NFS and NFS to new HDFS. This is because then I also need to backup the Hive metadata and then make it work with the new distro which may not be possible

How shall I proceed with this migration or at least the data transfer from HDFS to NFS and back ?

Durga Viswanath Gadiraju Durga Viswanath Gadiraju · Accepted Answer · 2015-12-01T12:16:32

These are the steps we follow:

Create new hadoop cluster
Copy data to new cluster using distcp
Drop the old cluster

If that is not an option

Write shell script which can copy data using hadoop fs -get
Make sure you apply the logic in such a way that same shell script can be run in parallel using nohup taking HDFS directory or file pattern as parameters

Hive, HDFS data to local system and back

2 Answers