0
votes
I am trying to read a csv (native) file from an S3 bucket using a locally running Spark - Scala. I am able to read the file using the http protocol but I intend to use the s3a protocol.
Below is the configuration setup before the call

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "Mykey") spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "Mysecretkey") spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider"); spark.sparkContext.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true") spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "eu-west-1.amazonaws.com") spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl.disable.cache", "true")

I am getting bellow exception:

 1. Exception in thread "main" java.lang.RuntimeException:
    java.lang.ClassNotFoundException: Class
    org.apache.hadoop.fs.s3a.S3AFileSystem not found    at
    org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
        at
    org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)

my spark version is: 2.3.1
scala version: 2.11
aws-java-sdk vesrion : 1.11.336
hadoop-aws :2.8.4 
2

2 Answers

0
votes

It's the S3 sdk lib missing exception, more detail could be found in https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html

Basic when you see a ClassNotFound Exception, it's caused by some binary file missing in your JVM class path, either the root classloader will load them from the java runtime directory and your application present directory, or an external classloader load it from some given path, check them carefully. May be you need to read more doc about ClassLoader, google it :)

0
votes

Important: Classpath setup

  1. The S3A connector is implemented in the hadoop-aws JAR. If it is not on the classpath: stack trace.
  2. Do not attempt to mix a "hadoop-aws" version with other hadoop artifacts from different versions. They must be from exactly the same release. Otherwise: stack trace.
  3. The S3A connector is depends on AWS SDK JARs. If they are not on the classpath: stack trace.
  4. Do not attempt to use an amazon S3 SDK JAR different from the one which the hadoop version was built with. Otherwise: stack trace highly likely.
  5. The normative list of dependencies of a specific version of the hadoop-aws JAR are stored in Maven, which can be viewed on mvnrepsitory.

https://cwiki.apache.org/confluence/display/HADOOP2/AmazonS3