What's the right way to provide Hadoop/Spark IAM role based access for S3?

Question

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".

Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.

Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.

Some references:

https://issues.apache.org/jira/browse/HADOOP-13277

https://issues.apache.org/jira/browse/HADOOP-9384

https://issues.apache.org/jira/browse/SPARK-16363

Are you using EMR to run the clusters or managing yourself, just the IAM roles work perfectly with EMR service? — Blakey

stevel stevel · Accepted Answer · 2017-02-15T10:58:10

That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.

If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.

That it: it should just work.

What's the right way to provide Hadoop/Spark IAM role based access for S3?

1 Answers