2
votes

I have set up my local PySpark, but every time I try to read files s3 with s3a protocol, it returns 403 AccessDenied Error. The account I am trying to connect to only supports AWS assumeRole and it gives me a Temporary Access_key, Secret_key, and session_token

I am using spark 2.4.4, with Hadoop 2.7.3, and aws-java-sdk-1.7.4 jar files. I know the issue is not with my security token as I can use the same credentials in boto3 to query the same bucket. I am setting up my Spark session as follow:

spark.sparkContext._conf.setAll([
[('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem'), 
('fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider'),
("fs.s3a.endpoint", "s3-ap-southeast-2.amazonaws.com"),
('fs.s3a.access.key', "..."),
('fs.s3a.secret.key', "..."),
('fs.s3a.session.token', "...")])
])

spark_01 = spark.builder.config(conf=conf).appName('s3connection').getOrCreate()

df = spark_01.read.load('s3a://<some bucket>')

Error I get is this:

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: ... , AWS Error Code

Update: The full Error stack:

19/10/08 16:37:17 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/usr/local/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o47.load.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: DFF18E66D647F534, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: ye5NgB5wRhmHpn37tghQ0EuO9K6vPDE/1+Y6m3Y5sGqxD9iFOktFUjdqzn6hd/aHoakEXmafA9o=
        at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:557)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:355)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)```
1
might be the problem related to the signature version?Lamanus
@Lamanus could you please elaborate on what you mean by signature version? ThankscodingEnthusiast
s3 access can be done with two types, those are V2 and V4 signature versions for the communicating from the client and S3. So you have to check your bucket requires V4 or not. If it is only valid for V4 then you have to set your client to use V4 signature.Lamanus
@Lamanus I understand to set the following but i still get the same error. : conf = (SparkConf().set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true").set("spark.driver.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")) scT=SparkContext(conf=conf) scT.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")codingEnthusiast
I have also followed your answer in this post: stackoverflow.com/questions/57477385/… and run the following params: spark-submit main.py -Dcom.amazonaws.services.s3.enableV4=true -Dcom.amazonaws.services.s3.enableV4=true but i still get the same errorcodingEnthusiast

1 Answers

2
votes

To solve this issue, we need to do two things as following. (I found that you already do second thing in in your code, so only first thing is required.)

  1. Use hadoop-aws-2.8.5.jar only rather than aws-java-sdk-1.7.4.jar with hadoop-aws-2.7.7.jar. (See Get Started section in https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
  2. Setup fs.s3a.aws.credentials.provider as following. For your code, ('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider') This make you enable to use token key. With this setting, you can provide the key in your code as you show or use system environment variables, such as, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.

For reference, this setting, ('fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.DefaultAWSCredentialsProviderChain') is also useful to load the credential key from ~/.aws/credentials without setting in your source code. (See, http://wrschneider.github.io/2019/02/02/spark-credentials-file.html)