How to access parquet file on us-east-2 region from spark2.3 (using hadoop aws 2.7)

Question

We are able to access us-east-1 from our current code, but we cannot access parquet file on us-east-2. Please note "us-east-2" connection, creating datafream works fine on intellij but it gives 400 error when we are trying from spark-shell.

I was trying to make is work on spark shell

/Users/test/Downloads/spark-2.3.3-bin-hadoop2.7/bin/spark-shell --jars /Users/test/Downloads/hadoop-aws-2.7.3.jar,/Users/test/Downloads/aws-java-sdk-1.7.4.jar

val configuration = sc.hadoopConfiguration

configuration.set("fs.s3a.impl.disable.cache", "true");

configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

configuration.set("fs.defaultFS", "s3a://parquet-dev");

configuration.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")

configuration.set("fs.s3a.access.key", “xyz");

configuration.set("fs.s3a.secret.key”,"abc");

val fileName = "s3a://xyz:abc@parquet-dev/c000.snappy.parquet"

val df = spark.sqlContext.read.parquet(fileName)

com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: asadfas, AWS Error Code: null, AWS Error Message: Bad Request

stevel stevel · Accepted Answer · 2019-07-31T15:42:51

fs.s3a.endpoint is the correct option; I've just verified it's in Hadoop 2.7
you shouldn't put the secrets in the filename URL, as they get logged everywhere.
and you shouldn't need to set the fs.defaultFS or fs.s3a.impl values

"Bad Request" is a fairly vague error from amazon, it means some kind of auth problem, without any details. It may be that you need to switch to V4 signing, which can only be done with the hadoop-2.7.x/AWS-1.7.x JARs via JVM properties. Other stack overflow posts cover that topic.

If you are trying to work with S3 through the S3A connector, you would be best off starting by upgrading to Hadoop 2.9 JARs and shaded AWS SDK, or 2.8.x as an absolute minimum. There's been dramatic changes in the hadoop-aws code, and the more current aws SDK makes a big difference too

How to access parquet file on us-east-2 region from spark2.3 (using hadoop aws 2.7)

2 Answers