2
votes

I am trying to read a csv file in a s3 bucket from spark using IAM Roles but getting a NoClassDefFoundError on MultiObjectDeleteException

I have installed Spark 2.4.4 without hadoop and installed hadoop 3.2.1 along with hadoop-aws-3.2.1.jar and aws-java-sdk-1.11.655.jar. I had to install a version of spark without hadoop because the hadoop jars that are part of the spark build is 2.7.3 which is from 2016.

sc.hadoopConfiguration.set("fs.s3a.credentialsType", "AssumeRole")
sc.hadoopConfiguration.set("fs.s3a.assumed.role.arn", "arn:aws:iam::[ROLE]")
val myRDD = sc.textFile("s3a://test_bucket/names.csv")
myRDD.count()

My IAM Policy that is attached to the role has the following

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutAccountPublicAccessBlock",
                "s3:GetAccountPublicAccessBlock",
                "s3:ListAllMyBuckets",
                "s3:ListJobs",
                "s3:CreateJob",
                "s3:HeadBucket"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::test_bucket"
        }
    ]
}

I have even tried sc.hadoopConfiguration.set("fs.s3a.multiobjectdelete.enable", "false") but same error as below:

java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:348)
  at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2575)
  at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2540)
  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2636)
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:239)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:325)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
  at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
  ... 49 elided
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.model.MultiObjectDeleteException
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 76 more
2

2 Answers

1
votes

The above issue was related to the IAM policy. It didn't have the policy to view the file "/*" is needed.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::test_bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::test_bucket/*"
            ]
        }
    ]
}

The Role that you create will have the above IAM Policy. The Role will be attached to the EC2 instance (master and slaves AWS EC2 instance) which is critical because Spark will resume the role of what is assigned to the EC2 instance. Therefore because EC2 is assigned the role, you do not need to specify the role in the scala code. All you need to do is write the following Scala code to read a file which will resume the role of what is assigned to the EC2 instance.

val myRDD = sc.textFile("s3a://test_bucket/test.csv")
myRDD.count()

The hadoop-3.2.1.tar.gz has both the hadoop-aws-3.2.1.jar and aws-java-sdk-bundle-1.11.375.jar located in /opt/hadoop/share/hadoop/tools/lib

This is where you want to make sure that you have defined a spark-env.sh that points to the correct jar directories so spark loads the jars in the classpath.

cp /opt/spark/conf/spark-env.sh.template /opt/spark/conf/spark-env.sh

export SPARK_DIST_CLASSPATH=/opt/spark/jars:/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/tools/lib/*
0
votes

There's no option fs.s3a.credentialsType; for s3a everything is in lower case, which helps debug these things,

The docs on assumed role credentials cover what permissions are needed https://hadoop.apache.org/docs/r3.1.0/hadoop-aws/tools/hadoop-aws/assumed_roles.html

The way this works on hadoop 3.2 is that the call that must have full permissions and then the s3a connector calls STS AssumeRole to create some short lived session credentials in the given role. In EC2 the VM's do not have permission to call AssumeRole (they are already running in a role) so you have to go with what ever the VM's were created with.

For now, use the s3a assumed role stuff to see what role policies permit.