4
votes

Thanks to stackoverflow, i managed to copy hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar from maven repo into $SPARK_HOME/jars/ to get s3a:// going for reading from S3 buckets using pyspark (spark 2.2.0) on my ec2 linux instance.

df=spark.read.option("header","true").csv("s3a://bucket/csv_file")

But I'm stuck at writing the transformed data back into s3 bucket with server side encryption enabled. As expected below action throws "Access Denied" as I haven't specified flag to enable server side encryption within pyspark execution env

df.write.parquet("s3a://s3_bucket/output.parquet")

To verify, I wrote to a local file and uploaded to s3 bucket using -sse and this works fine

aws s3 cp local_path s3://s3_bucket/ --sse

How do I enable server side encryption in pyspark similar to above?

Note: I did try adding "fs.s3a.enableServerSideEncryption true" to spark-default.conf and passing the same via --conf parameter of pyspark at start but no joy.

Thanks

2

2 Answers

4
votes

The way I understood after going through following Hadoop JIRAs - HADOOP-10675, HADOOP-10400, HADOOP-10568

Since fs/s3 is part of Hadoop following needs to be added into spark-default.conf if all S3 bucket puts in your estate is protected by SSE

spark.hadoop.fs.s3a.server-side-encryption-algorithm AES256

And after adding this I was able to write successfully to S3 bucket protected by SSE (Server side encryption).

2
votes

Hope you already setup configuration with access-keys, secret-keys, enableServerSideEncryption and algorithm to be used for the encryption.

val hadoopConf = sc.hadoopConfiguration;
    hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
    hadoopConf.set("fs.s3.awsAccessKeyId", "xxx")
    hadoopConf.set("fs.s3.awsSecretAccessKey", "xxx")
    hadoopConf.set("fs.s3.enableServerSideEncryption", "true")
    hadoopConf.set("fs.s3.serverSideEncryptionAlgorithm","AES256")

Enforces server side encryption

--emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryptionAlgorithm=AES256].

Command :

./bin/spark-submit --verbose —jars lib/app.jar \

--master spark://master-amazonaws.com:7077  \
--class com.elsevier.spark.SparkSync \
--conf "spark.executor.extraJavaOptions=-Ds3service.server-side-encryption=AES256"

http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html

Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3)

Server-side encryption is about protecting data at rest. Server-side encryption with Amazon S3-managed encryption keys (SSE-S3) employs strong multi-factor encryption. Amazon S3 encrypts each object with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt your data.

Amazon S3 supports bucket policies that you can use if you require server-side encryption for all objects that are stored in your bucket. For example, the following bucket policy denies upload object (s3:PutObject) permission to everyone if the request does not include the x-amz-server-side-encryption header requesting server-side encryption.

{
  "Version": "2012-10-17",
  "Id": "PutObjPolicy",
  "Statement": [
    {
      "Sid": "DenyIncorrectEncryptionHeader",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::YourBucket/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    },
    {
      "Sid": "DenyUnEncryptedObjectUploads",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::YourBucket/*",
      "Condition": {
        "Null": {
          "s3:x-amz-server-side-encryption": "true"
        }
      }
    }
  ]
}