I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.
I setup hadoop this way:
sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)
I build an uber-jar using sbt assembly using:
name := "test"
version := "0.2.0"
scalaVersion := "2.11.8"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
ExclusionRule("com.amazonaws", "aws-java-sdk"),
ExclusionRule("commons-beanutils")
)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"
When I submit my job to the cluster, I always got the following errors:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)
It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.
However, I tried as well without success to configure spark-submit using:
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
PS: If I switch to s3n protocol, I got the following exception:
java.io.IOException: No FileSystem for scheme: s3n
sc.addFile(s3://pathToLibrary.so)
and then I load it usingSystem.load(SparkFiles.get(Library.so))
. – elldekaa