(python) Spark .textFile(s3://...) access denied 403 with valid credentials

Question

In order to access my S3 bucket i have exported my creds

export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=

I can verify that everything works by doing

aws s3 ls mybucket

I can also verify with boto3 that it works in python

resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
            .put(Body=open("text.py", "rb"),ContentType="text/x-py")

This works and I can see the file in the bucket.

However when I do this with spark:

spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)

I get a nice

> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>

this is how I submit the job locally

spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py

Why does it works with command line + boto3 but spark is chocking ?

EDIT:

Same issue using s3a:// with

hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

and same issue using aws-sdk 1.7.4 and hadoop 2.7.2

read this? cloudera.com/documentation/enterprise/latest/topics/… — OneCricketeer
I think that it would also work with exporting the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY. is create that credentials file really a necessity? as you can see Spark correctly picks up the AWS_ACCESS_KEY from the env variable but for reason fails to authenticate ? — Johny19
Spark is distributed. Because you have ENV variables in one executor doesn't mean the other executors have it as well. You should use SparkConf to set the values — OneCricketeer
getting the same error with hadoopConf = spark_context._jsc.hadoopConfiguration() hadoopConf.set("fs.s3.awsAccessKeyId", "xxxxx") hadoopConf.set("fs.s3.awsSecretAccessKey", "xxxxxx) and same error when setting them with SparkConf — Johny19
@cricket_007I think I found the issue I have created a new post for it stackoverflow.com/questions/42669246/… — Johny19

stevel stevel · Accepted Answer · 2017-03-07T16:06:24

Spark will automatically copy your AWS Credentials to the s3n and s3a secrets. Apache Spark releases don't touch the s3:// URLs, as in Apache Hadoop, the s3:// schema is associated with the original, now deprecated s3 client, one which is incompatible with everything else.

On Amazon EMR, s3:// is bound to amazon EMR S3; EC2 VMs will provide the secrets for executors automatically. So I don't think it bothers with the env var propagation mechanism. It might also be that how it sets up the authentication chain, you can't override the EC2/IAM data.

If you are trying to talk to S3 and you are not running in an EMR VM, then presumably you are using Apache Spark with the Apache Hadoop JARs, not the EMR versions. In that world use URLs with s3a:// to get the latest S3 client library

If that doesn't work, look at the troubleshooting section of the apache docs. There's a section on "403" there including recommended steps for troubleshooting. It can be due to classpath/JVM version problems as well as credential issues, even clock-skew between client and AWS.

(python) Spark .textFile(s3://...) access denied 403 with valid credentials

1 Answers