I have an unexplained problem with uploading large files to s3a. I am using EC2 Instance with spark-2.4.4-bin-hadoop2.7 and Spark DataFrame to write to s3a with V4 version. Authenticating S3 using Access Key and Secret Key.
The procedure is as follows: 1) read csv file from s3a as the Spark DataFrame; 2) processing data; 3) upload Data Frame as format parquet to s3a
If I do the procedure with the 400MB csv file there is no problem, everything works fine. But when I do the same with a 12 GB csv file in the process of writing parquet file to s3a an error appears:
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 2CA5F6E85BC36E8D, AWS Error Code: SignatureDoesNotMatch, AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.
I use the following settings:
import pyspark
from pyspark import SparkContext
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
sc = SparkContext()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = sc._jsc.hadoopConfiguration()
accesskey = input()
secretkey = input()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.endpoint", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("fs.s3a.fast.upload", "true")
hadoopConf.set("fs.s3a.fast.upload", "s3-eu-north-1.amazonaws.com")
hadoopConf.set("com.amazonaws.services.s3a.enableV4", "true")
hadoopConf.set("fs.s3a.access.key", accesskey)
hadoopConf.set("fs.s3a.secret.key", secretkey)
also tried to add these settings:
hadoopConf.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')
hadoopConf.set('spark.speculation', "false")
hadoopConf.set('spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
hadoopConf.set('spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4', 'true')
but it didn’t help.
Again, the problem appears only with large file.
I would appreciate any help. Thank you.
pyspark --driver-memory 16g --executor-memory 16g --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
to start pyspark and when I try uploading a small file to S3, it works fine, but when I try a large file (around 10 GB), it throws a confusing 403 Error. – JoshuaMHew