Hadoop server connection for copying files from HDFS to AWS S3

Question

Requirement is to copy hdfs files from Hadoop cluster(non-aws) to AWS S3 bucket with standalone java application scheduled with daily CRON. Would be using AmazonS3.copyObject() method for copying. How to specify the kerberized server connection details for the source Hadoop cluster so that S3client can access the files from source hdfs folder.

The below command was used earlier but its not the secure way of transferring files.

hadoop distcp -Dfs.s3a.access.key=<<>> -Dfs.s3a.secret.key=<<>> hdfs://nameservice1/test/test1/folder s3a://<>/test/test1/folder

Add to the core-site.xml file cloudera.com/documentation/enterprise/5-10-x/topics/… — OneCricketeer

stevel stevel · Accepted Answer · 2018-06-22T04:31:36

S3 doesn't go near kerberos; your cronjob will have to use kinit off a keytab to authenticate for the HDFS access.

The most secure way to pass secrets to distcp is to keep them in a JCEKS file in the cluster FS, such as one in the home dir of the user running the job, with permissions only for reading by that person (max paranoia: set a password for encrypting and pass that in with the job). See Protecting S3 Credentials with Credential Providers

One more trick to try: create session credentials using the CLI assume role command, and pass the temporary credentials to distcp for s3a to pick up. That way, yes, the secrets are visible for .ps, but they aren't the longer lived secrets. You can also ask for a specific role there with restricted access compared to the user's full account (e.g: r/w access to one bucket only)

Hadoop server connection for copying files from HDFS to AWS S3

1 Answers