Spark streaming connection to S3 gives Forbidden error

Question

I'm running a Spark streaming app from my local to read from an S3 bucket.

I'm using the Hadoop-AWS jar to set S3 authentication parameters - https://hadoop.apache.org/docs/r3.0.0/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3

This is the error message 'Forbidden':

org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - Caught an AmazonServiceException, which means your request made it to Amazon S3, but was rejected with an error response for some reason.
org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: #####, AWS Error Code: null, AWS Error Message: Forbidden
org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - HTTP Status Code: 403
org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - AWS Error Code: null
org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - Error Type: Client
org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - Request ID: #####
org.apache.hadoop.fs.s3a.S3AFileSystem printAmazonServiceException - Class Name: com.amazonaws.services.s3.model.AmazonS3Exception

Code to read from bucket:

val sc: SparkContext = createSparkContext(scName)
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val ssc = new StreamingContext(sc, Seconds(time))
val lines = ssc.textFileStream("s3a://foldername/subfolder/")
lines.print()

I have set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN variables on my terminal but it still gives me 'Forbidden'.

I am able to access S3 from the terminal though (using the AWS profile) so I'm not sure why it doesn't work when I go through Spark. Any ideas appreciated.

You need to export those variables to all executors, not just your local machine — OneCricketeer
@cricket_007 how do I do that? If I set those variables in hadoopConf isn't that enough? — covfefe
"fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem" only sets the filesystem. You also need to set the keys. See code here. stackoverflow.com/q/49230086/2308683 Also see cloudera.com/documentation/enterprise/latest/topics/… Search for "Specify the credentials at run time" — OneCricketeer

OneCricketeer OneCricketeer · Accepted Answer · 2018-03-13T18:30:24

In order to obfuscate the keys away from the code in plain-text.

You can add a core-site.xml file to the classpath with the keys

<property>
    <name>fs.s3a.access.key</name>
    <value>...</value>
</property>
<property>
    <name>fs.s3a.secret.key</name>
    <value>...</value>
</property>

Or if you don't care about putting the keys directly in the code,

sc.hadoopConfiguration.set("fs.s3a.access.key", "...")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "...")

The recommended way is to use a Java jceks credential file

Spark streaming connection to S3 gives Forbidden error

1 Answers