0
votes

I need to access Parquet files on S3 from Spark 1.6. My attempted approach is at the bottom, and I get a "403 Forbidden" error, which is the same error as if the keys are invalid or missing.

Historically, I've done that using the deprecated s3n with keys in-line:

s"s3n://${key}:${secretKey}@${s3_bucket}"

For all the well-documented reasons, s3n and this format is problematic.

When I access JSON from Spark 1.6, this works...

sc.hadoopConfiguration.set("fs.s3a.endpoint",  "s3.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.access.key", key)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

val fg = "s3a://bucket/folder/dt=" + day
val rdd = sc.textFile(fg)

When I access Parquet from Spark 2.x, this works...

val spark = {
  SparkSession.builder //()
  .config("fs.s3a.access.key", key)
  .config("fs.s3a.secret.key", secretKey)
  .config("mapreduce.fileoutputcommitter.algorithm.version", "2")
  .getOrCreate()
}

val fg = "s3a://bucket/folder/dt=" + day
val parqet = spark.read.parquet(fg)

My best guess for Parquet from Spark 1.6 would be something like this, but I get a 403 Forbidden error...

sqlContext.setConf("fs.s3a.access.key", key)
sqlContext.setConf("fs.s3a.secret.key", secretKey)
sqlContext.setConf("fs.s3a.endpoint", "s3.amazonaws.com")

val fg = "s3a://bucket/folder/dt=" + day
val parq = sqlContext.read.parquet(fg)    // 403 Forbidden here

Any advice appreciated.

EDIT - adding some detail about other s3a settings.

// sqlContext.getAllConfs.filter(_._1 contains "s3a").foreach(println)
(fs.s3a.connection.maximum,15)
(fs.s3a.impl,org.apache.hadoop.fs.s3a.S3AFileSystem)
(fs.s3a.fast.buffer.size,1048576)
(fs.s3a.awsSecretAccessKey,DikembeMutombo)
(fs.s3a.connection.timeout,50000)
(fs.s3a.buffer.dir,${hadoop.tmp.dir}/s3a)
(fs.s3a.endpoint,s3.amazonaws.com)
(fs.s3a.paging.maximum,5000)
(fs.s3a.threads.core,15)
(fs.s3a.multipart.purge,false)
(fs.s3a.threads.max,256)
(fs.s3a.multipart.threshold,2147483647)
(fs.s3a.awsAccessKeyId,DikembeMutombo)
(fs.s3a.connection.ssl.enabled,true)
(fs.s3a.connection.establish.timeout,5000)
(fs.s3a.threads.keepalivetime,60)
(fs.s3a.max.total.tasks,1000)
(fs.s3a.fast.upload,false)
(fs.s3a.attempts.maximum,10)
(fs.s3a.multipart.size,104857600)
(fs.s3a.multipart.purge.age,86400)

EDIT 2 - adding the workaround I'm using if I can't solve.

I think I have two options that I can make work, but I don't care for either.

  • s3a with credentials either hard-coded in core-site.xml OR set dynamically using sqlContext.setConf()... does not work.
  • s3a with credentials inline in the URL... works.
  • s3 with credentials set via sqlContext.setConf()... does not work.
  • s3 with credentials hard-coded in core-site.xml... works.

Unfortunately... including credentials in-line is bad security practice (and creates problems)... but hard-coding in core-site.xml isn't a complete solution. I need to be able to toggle between 4 sets of credentials. So my hack-y solution is to hard code s3 credentials in core-site.xml... and use inline with s3a to access Parquet in other AWS environments.

Would rather use s3a only and never with inline credentials, but I can't get that to work unfortunately.

1

1 Answers

0
votes

Debugging auth problems is hard as everyone goes out of their way to not log the secrets.

I'd start with the Troubleshooting S3A documentation.

Here I think the issue is that the name of the secret and access key was changed in s3a to fs.s3a.access.key and fs.s3a.secret.key

Set them and hopefully your problems will go away.

I agree with your goal of not putting secrets inline; in Hadoop 2.8+ the s3a connector will warn you against secrets-in-URLs, because it's hard to keep them out of logs.