1
votes

I have a HDP cluster on AWS and I have one s3(in other account) also, my hadoop version is Hadoop 3.1.1.3.0.1.0-187

Now I want to read from the s3 (which is in different account) and process, then write the result to my s3(same account as cluster).

But as per the HDP guide Here tells, I can configure only one keys of either my account or other account.

But in my case I want to configure two account keys, so How to do do that ?

Due to some security reason, other account can not change the bucket policy to add IAM role which is created in my account , Hence I tried to access like below

  1. Configured the keys of other account
  2. Added IAM role(which has access policy for my bucket) of my account

but Still I got below error when I tried to access my account s3 from spark write

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3

2

2 Answers

2
votes

What you need is to use the EC2 instance profile role. It is an IAM role that is attached to your instance: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html

You first create a role with permissions that allow s3 access. Then you attach that role to your HDP cluster(EC2 autoscaling group and EMR can both achieve that).No IAM access key configuration needed on your side, although AWS still does that for you in the background. This is the s3 "outbound" access part.

The 2nd step is to set up the bucket policy to allow cross-account access: https://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html You will need to do this for each bucket in your different accounts. This is basically the "inbound" s3 access permission part.

You will encounter 400 if any part of your access(i.e., your instance profile role's permission, S3 bucket ACL, bucket policy, public access block setting and etc..) is denied in the permission chain. There are much more layers on the "inbound" side. So to start to get things working, if you are not IAM expert, try to start with a very open policy(use '*' wildcard) and then narrow things down.

0
votes

If I've understood right

  1. you want your EC2 VMs to access an S3 bucket to which the IAM role doesn't have access
  2. your have a set of AWS login details for the external S3 bucket (login and password)

HDP3 has an default auth chain of, in order

  1. per-bucket secrets. fs.s3a.bucket.NAME.access.key, fs.s3a.bucket.NAME.secret.key
  2. config-wide secrets fs.s3a.access.key, fs.s3a.secret.key
  3. env vars AWS_ACCESS_KEY and AWS_SECRET_KEY
  4. the IAM Role (it does an HTTP GET to the 169.something server which serves up a new set of IAM role credentials at least once an hour)

What you need to try here is set up some per-bucket secrets for only the external source (either in a JCEKS file on all nodes in core-site.xml, or in the spark default. For example, if the external bucket was s3a://external, you'd have

spark.hadoop.fs.s3a.bucket.external.access.key AKAISOMETHING spark.hadoop.fs.s3a.bucket.external.secret.key SECRETSOMETHING

HDP3/Hadoop 3 can handle >1 secret in the same JCEKS file without problems. HADOOP-14507. my code. Older versions let you put username:secret in the URI, but that's such a security troublespot (everything logs those URIs as they aren't viewed as sensistive), that feature has been cut from Hadoop now. Stick to the JCEKs file with a per-bucket secret, falling back to IAM role for your own data

Note you can fiddle with the authentication list for ordering and behaviour: if you add use the TemporaryAWSCredentialsProvider then it'll support session keys as well, which is often handy. <property> <name>fs.s3a.aws.credentials.provider</name> <value> org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider, org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, com.amazonaws.auth.EnvironmentVariableCredentialsProvider, org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider </value> </property>