Accessing S3 from Spark 2.0

Question

I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.

I setup hadoop this way:

sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)

I build an uber-jar using sbt assembly using:

name := "test"
version := "0.2.0"
scalaVersion := "2.11.8"

libraryDependencies += "com.amazonaws" % "aws-java-sdk" %   "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
    ExclusionRule("com.amazonaws", "aws-java-sdk"),
    ExclusionRule("commons-beanutils")
)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"

When I submit my job to the cluster, I always got the following errors:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)

It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.

However, I tried as well without success to configure spark-submit using:

--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

PS: If I switch to s3n protocol, I got the following exception:

java.io.IOException: No FileSystem for scheme: s3n

have you using accessing s3 file from any transformation or action?? — Sandeep Purohit
Yes, I call native C functions (through JNI) within map and reduce operation. In fact, the library containing those functions was uploaded in an S3 bucket, and it needs to be loaded by every worker nodes before running the native functions. So I add the library to the spark context using sc.addFile(s3://pathToLibrary.so) and then I load it using System.load(SparkFiles.get(Library.so)). — elldekaa
you can copy the aws-sdk and hadoop-aws to jars folder inside spark. (it should solve the ClassNotFoundException issue). — rdllopes

Sandeep Purohit Sandeep Purohit · Accepted Answer · 2016-09-22T06:58:42

Actually all operations of spark working on workers. and you set these configuration on master so once you can try to app configuration of s3 on mapPartition{ }

Accessing S3 from Spark 2.0

2 Answers