0
votes

I have scenario in which i have to pull data from Hadoop cluster into AWS. I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.

I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?

Any Suggestions ?

1
"Without commands"? Umm. No. Sqoop export? Did you read that documentation? AWS just needs network access to your cluster . Or Try this github.com/HotelsDotCom/circus-train - OneCricketeer
You could look at nifi.apache.org to help do this perhaps. - Binary Nerd

1 Answers

0
votes

If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.

You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.

Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use