0
votes

I am using AWS EMR 5.0, Spark 2.0, Scala 2.11, S3 - encrypted with KMS(SSE-custom key), Parquet files. I can read the encrypted parquet files - no problem. However, when I write, I get a warning. Simplified code looks like:

val headerHistory = spark.read.parquet("s3://<my bucket>/header_1473640645")
headerHistory.write.parquet("s3://<my bucket>/temp/")

but generates a warning:

16/09/15 13:11:11 WARN S3V4AuthErrorRetryStrategy: Attempting to re-send the request to my bucket.s3.amazonaws.com with AWS V4 authentication. To avoid this warning in the future, please use region-specific endpoint to access buckets located in regions that require V4 signing.

Do I need an option? Do I need to set some environment variable?

1

1 Answers

1
votes

Thank you for providing additional details.

Yes, it is a known issue with KMS+SSE when using EMRFS(library under the hood for s3 communication).

The problem was when server side encryption + kms is enabled, the s3client in emrfs crafted request without specifying the signer type. In a conservative way, s3 would try V2 initially, and then retry with V4 if first attempts failed. Such behavior will slow down the overall process. EMRFS will be patched to specify using V4 at first attempt, this should be fixed in the next EMR release.

As mentioned, it doesn't break the job.

Please keep an eye for coming emr-5.x (no ETA)

https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html