0
votes

I have set up an Spark EMR cluster on AWS (Hadoop 2.8.5, Spark 2.4.4). I have an s3 bucket url and it's access credentials. After setting up the cluster and attaching a notebook, I am able to read the data from the bucket using spark.read.parquet("s3n://...") after setting the hadoop configurations using:

sc._jsc.hadoopConfiguration().set('fs.s3n.awsAccessKeyId', '...')
sc._jsc.hadoopConfiguration().set('fs.s3n.awsSecretAccessKey', '...')

However I read in numerous documentations that this is not recommended as it stores the keys in the logs.
So What I am trying is to create a Hadoop credential file in the HDFS file system and then adding an EMR configuration in 'core-site' to provide the credential file path. Following are the steps I followed:
1. Created the EMR Cluster
2. Using SSH via Putty.exe, I created the hadoop credential file:

$ hadoop credential create fs.s3a.access.key -provider jceks://hdfs/<path_to_hdfs_file> -value <aws_access_id>
$ hadoop credential create fs.s3a.secret.key -provider jceks://hdfs/<path_to_hdfs_file> -value <aws_secret_key>

3. I added a configuration to the instance profiles from the management console under 'core-site' classification, and provided the path "jceks://hdfs/path_to_hdfs_file" to spark.hadoop.security.credential.provider.path and applied configuration to the master and slaves.

The Issue:
Yet, I am not able to access the bucket from the EMR notebook using spark.read.parquet(), it throws an Access Denied exception. Am I doing it wrong or is there some intermediate step I am missing here. I do not want to hard code the keys in my EMR notebook. Any help will be highly appreciated. I have been stuck with this issue since a week.
P.S. The bucket and cluster are in different regions. However, I have also tried the same process by creating the cluster in the same location as the bucket. The issue still persists.

1

1 Answers

1
votes
  • Access to S3 data in EMR should be with with their connector and s3:// URLs; any other schema references code they don't support.
  • You get the access of the IAM role the VM/container was deployed with. Want to access a specific bucket, choose the right role

It's moot, but the s3n connector (obsolete, unsupported) doesn't support JCEKs files