I need to access Parquet files on S3 from Spark 1.6. My attempted approach is at the bottom, and I get a "403 Forbidden" error, which is the same error as if the keys are invalid or missing.
Historically, I've done that using the deprecated s3n with keys in-line:
s"s3n://${key}:${secretKey}@${s3_bucket}"
For all the well-documented reasons, s3n and this format is problematic.
When I access JSON from Spark 1.6, this works...
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.access.key", key)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
val fg = "s3a://bucket/folder/dt=" + day
val rdd = sc.textFile(fg)
When I access Parquet from Spark 2.x, this works...
val spark = {
SparkSession.builder //()
.config("fs.s3a.access.key", key)
.config("fs.s3a.secret.key", secretKey)
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.getOrCreate()
}
val fg = "s3a://bucket/folder/dt=" + day
val parqet = spark.read.parquet(fg)
My best guess for Parquet from Spark 1.6 would be something like this, but I get a 403 Forbidden error...
sqlContext.setConf("fs.s3a.access.key", key)
sqlContext.setConf("fs.s3a.secret.key", secretKey)
sqlContext.setConf("fs.s3a.endpoint", "s3.amazonaws.com")
val fg = "s3a://bucket/folder/dt=" + day
val parq = sqlContext.read.parquet(fg) // 403 Forbidden here
Any advice appreciated.
EDIT - adding some detail about other s3a settings.
// sqlContext.getAllConfs.filter(_._1 contains "s3a").foreach(println)
(fs.s3a.connection.maximum,15)
(fs.s3a.impl,org.apache.hadoop.fs.s3a.S3AFileSystem)
(fs.s3a.fast.buffer.size,1048576)
(fs.s3a.awsSecretAccessKey,DikembeMutombo)
(fs.s3a.connection.timeout,50000)
(fs.s3a.buffer.dir,${hadoop.tmp.dir}/s3a)
(fs.s3a.endpoint,s3.amazonaws.com)
(fs.s3a.paging.maximum,5000)
(fs.s3a.threads.core,15)
(fs.s3a.multipart.purge,false)
(fs.s3a.threads.max,256)
(fs.s3a.multipart.threshold,2147483647)
(fs.s3a.awsAccessKeyId,DikembeMutombo)
(fs.s3a.connection.ssl.enabled,true)
(fs.s3a.connection.establish.timeout,5000)
(fs.s3a.threads.keepalivetime,60)
(fs.s3a.max.total.tasks,1000)
(fs.s3a.fast.upload,false)
(fs.s3a.attempts.maximum,10)
(fs.s3a.multipart.size,104857600)
(fs.s3a.multipart.purge.age,86400)
EDIT 2 - adding the workaround I'm using if I can't solve.
I think I have two options that I can make work, but I don't care for either.
- s3a with credentials either hard-coded in core-site.xml OR set dynamically using sqlContext.setConf()... does not work.
- s3a with credentials inline in the URL... works.
- s3 with credentials set via sqlContext.setConf()... does not work.
- s3 with credentials hard-coded in core-site.xml... works.
Unfortunately... including credentials in-line is bad security practice (and creates problems)... but hard-coding in core-site.xml isn't a complete solution. I need to be able to toggle between 4 sets of credentials. So my hack-y solution is to hard code s3 credentials in core-site.xml... and use inline with s3a to access Parquet in other AWS environments.
Would rather use s3a only and never with inline credentials, but I can't get that to work unfortunately.