1
votes

I have an unexplained problem with uploading large files to s3a. I am using EC2 Instance with spark-2.4.4-bin-hadoop2.7 and Spark DataFrame to write to s3a with V4 version. Authenticating S3 using Access Key and Secret Key.

The procedure is as follows: 1) read csv file from s3a as the Spark DataFrame; 2) processing data; 3) upload Data Frame as format parquet to s3a

If I do the procedure with the 400MB csv file there is no problem, everything works fine. But when I do the same with a 12 GB csv file in the process of writing parquet file to s3a an error appears:

Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2CA5F6E85BC36E8D, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.

I use the following settings:

import pyspark

from pyspark import SparkContext

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"

sc = SparkContext()

sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

hadoopConf = sc._jsc.hadoopConfiguration()

accesskey = input()
secretkey = input()

hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("fs.s3a.fast.upload", "true")

hadoopConf.set("fs.s3a.fast.upload", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.access.key", accesskey)

hadoopConf.set("fs.s3a.secret.key", secretkey)

also tried to add these settings:

hadoopConf.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')
hadoopConf.set('spark.speculation', "false")

hadoopConf.set('spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')

hadoopConf.set('spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')

but it didn’t help.

Again, the problem appears only with large file.

I would appreciate any help. Thank you.

3
I am having the same issue: Spark 2.2.0 using hadoop 2.7.2. I run pyspark --driver-memory 16g --executor-memory 16g --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 to start pyspark and when I try uploading a small file to S3, it works fine, but when I try a large file (around 10 GB), it throws a confusing 403 Error.JoshuaMHew

3 Answers

1
votes

Try setting fs.s3a.fast.upload to true,

otherwise, the multipart upload stuff was only ever experimental in 2.7; you may have hit a corner case. Upgrade to the hadoop-2.8 versions or later and it should go away.

0
votes

Updated hadoop from 2.7.3 to 2.8.5 and now everything works without errors.

0
votes

Had same issue. Made a Spark Cluster on EMR (5.27.0) and configured it with Spark 2.4.4 on Hadoop 2.8.5. Uploaded my notebook that had my code on it to a notebook I made in EMR JupyterLab, ran it, and it worked perfectly!