How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

Question

I have scenario in which i have to pull data from Hadoop cluster into AWS. I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.

I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?

Any Suggestions ?

"Without commands"? Umm. No. Sqoop export? Did you read that documentation? AWS just needs network access to your cluster . Or Try this github.com/HotelsDotCom/circus-train — OneCricketeer

stevel stevel · Accepted Answer · 2018-03-08T13:57:33

If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.

You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.

Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use

How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

1 Answers