0
votes

We are able to access us-east-1 from our current code, but we cannot access parquet file on us-east-2. Please note "us-east-2" connection, creating datafream works fine on intellij but it gives 400 error when we are trying from spark-shell.

I was trying to make is work on spark shell

/Users/test/Downloads/spark-2.3.3-bin-hadoop2.7/bin/spark-shell --jars /Users/test/Downloads/hadoop-aws-2.7.3.jar,/Users/test/Downloads/aws-java-sdk-1.7.4.jar

val configuration = sc.hadoopConfiguration

configuration.set("fs.s3a.impl.disable.cache", "true");

configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

configuration.set("fs.defaultFS", "s3a://parquet-dev");

configuration.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")

configuration.set("fs.s3a.access.key", “xyz");

configuration.set("fs.s3a.secret.key”,"abc");

val fileName = "s3a://xyz:abc@parquet-dev/c000.snappy.parquet"

val df = spark.sqlContext.read.parquet(fileName)

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: asadfas, AWS Error Code: null, AWS Error Message: Bad Request

2

2 Answers

1
votes
  1. fs.s3a.endpoint is the correct option; I've just verified it's in Hadoop 2.7
  2. you shouldn't put the secrets in the filename URL, as they get logged everywhere.
  3. and you shouldn't need to set the fs.defaultFS or fs.s3a.impl values

"Bad Request" is a fairly vague error from amazon, it means some kind of auth problem, without any details. It may be that you need to switch to V4 signing, which can only be done with the hadoop-2.7.x/AWS-1.7.x JARs via JVM properties. Other stack overflow posts cover that topic.

If you are trying to work with S3 through the S3A connector, you would be best off starting by upgrading to Hadoop 2.9 JARs and shaded AWS SDK, or 2.8.x as an absolute minimum. There's been dramatic changes in the hadoop-aws code, and the more current aws SDK makes a big difference too

0
votes

It is simple change but hard to find on AWS docs or any where else

Below are the changes we made (Language-specific code can be done)

spark-shell \
    --master local[4] \
    --driver-memory 2g \
    --conf 'spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true' \
    --conf 'spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true' \
    --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.7.jar


System.getProperty("com.amazonaws.services.s3.enableV4")

sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
sc.hadoopConfiguration.set("fs.s3a.access.key", "access")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "secret")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")

val fileName = "s3a://parquet123/c000.parquet"
val df = spark.sqlContext.read.parquet(fileName)
df.count

Few new s3 buckets only support Signature Version(s) v4 Support, that old SDKs doesn’t work on unless you specify

System.getProperty(“com.amazonaws.services.s3.enableV4”) sc.hadoopConfiguration.set(“fs.s3a.endpoint”, “s3.us-east-2.amazonaws.com”)

and

--conf ‘spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true’ --conf ‘spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true’

System.getProperty(“com.amazonaws.services.s3.enableV4”) is very importent must be set for all excutor JVMs by specifing above flags. Thanks