2
votes

I'm running Spark 2.4 on an EC2 instance. I am assuming an IAM role and setting the key/secret key/token in the sparkSession.sparkContext.hadoopConfiguration, along with the credentials provider as "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider".

When I try to read a dataset from s3 (using s3a, which is also set in the hadoop config), I get an error that says

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 7376FE009AD36330, AWS Error Code: null, AWS Error Message: Forbidden

read command:

val myData = sparkSession.read.parquet("s3a://myBucket/myKey")

I've repeatedly checked the S3 path and it's correct. My assumed IAM role has the right privileges on the S3 bucket. The only thing I can figure at this point is that spark has some sort of hidden credential chain ordering and even though I have set the credentials in the hadoop config, it is still grabbing credentials from somewhere else (my instance profile???). But I have no way to diagnose that.

Any help is appreciated. Happy to provide any more details.

1
You might be able to use CloudTrail to view the denied request. It would then provide you with the Access Key that was used, so you can figure out which credentials it is using. That should help you track down where it is coming from.John Rotenstein

1 Answers

1
votes
  1. spark-submit will pick up your env vars and set them as the fs.s3a access +secret + session key, overwriting any you've already set.
  2. If you only want to use the IAM credentials, just set fs.s3a.aws.credentials.provider to com.amazonaws.auth.InstanceProfileCredentialsProvider; it'll be the only one used

Further Reading: Troubleshooting S3A