4
votes

I run Apache Spark (2.11, 1.5.2) on a local machine using input files stored in AWS S3. If the files are stored in a bucket in the Ireland region (eu-west-1) it works fine.

But if I try to read files stored in an S3 bucket located in Frankfurt (eu-central-1) it fails with the error message:

The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256

How can I use AWS4-HMAC-SHA256?

The detailed error message is:

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/%2myfolder' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRequest</Code><Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message><RequestId>ECB53FECECD1C910</RequestId><HostId>BmEyVcO/eHZR3IO2Z+8IkEWOn189IBGb2YAgbDxhTu+abuyORCEjHyC14l6nIRVNNnQL2Nyya9I=</HostId></Error>
    at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:174)
    at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:214)
...
Caused by: org.jets3t.service.S3ServiceException: S3 GET failed for '/%2myfolder' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRequest</Code><Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message><RequestId>ECB53FECECD1C910</RequestId><HostId>BmEyVcO/eHZR3IO2Z+8IkEWOn189IBGb2YAgbDxhTu+abuyORCEjHyC14l6nIRVNNnQL2Nyya9I=</HostId></Error>
    at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:416)
    at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestGet(RestS3Service.java:752)

The code is:

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class S3Problem {
  public static void main(String[] args) {
    String s3Folder = "s3n://mybucket/myfolder";

    SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");

    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> myData = sc.textFile(s3Folder).cache();
    long count = myData.count();

    System.out.println("Line count: " + count);
  }
}

AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are provided as environment variables.

3

3 Answers

9
votes

Putting together Ewan and windsource's answers into a working (at least for me) complete script for PySpark:

import findspark
findspark.init()
import pyspark

spark = pyspark.sql.SparkSession.builder \
    .master("local[*]") \
    .appName("Spark") \
    .config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
    .getOrCreate()

# Set the property for the driver. Doesn't work using the same syntax 
# as the executor because the jvm has already been created.
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", "***")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", "8080")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "***")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "***")

test = spark.sparkContext.textFile('s3a://my-bucket/test')
print(test.take(5))
5
votes

Your path is set to s3://, I think it should be s3n://

Try changing that, along with using these authentication parameters:

val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3n.awsAccessKeyId","key")
hadoopConf.set("fs.s3n.awsSecretAccessKey","secret") 

Alternatively you could try using s3a:// but you'll have to include the hadoop-aws and aws-java-sdk jar files into your CLASSPATH.

0
votes

Frankfurt region is using auth v4, an easier way to use s3n impl for s3 path, is set something like this in core-site.xml (use s3n for both s3 and s3n path)

<property>
  <name>fs.s3.impl</name>
  <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

but, you should really think about upgrade s3 impl to s3a, it is using aws sdk.

you need to put hadoop-aws jar and aws-java-sdk jar (and third-party jars in its package) into your CLASSPATH.

hadoop-aws: http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/

aws-java-sdk: http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/

then in core-site.xml

<property>
  <name>fs.s3.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
  <name>fs.s3a.impl</name>
  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>