HadoopRDD error while trying to count lines in a file hosted on local HDFS using spark shell

Question

I am new to Apache Spark, Scala and Hadoop tools.

I have setup a new local single node Hadoop cluster as mentioned here and have also setup spark providing a reference to this Hadoop environment as mentioned here.

I am able to verify that spark-shell, spark UI is up and running. Also, I am able to view the HDFS using localhost.

To go a step further, I uploaded a sample file to HDFS and verified it is available using Hadoop localhost.

Now, I try to count the lines in the file using Java and spark-shell (Scala), but both the invocations are failing with this stack trace.

Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/InputSplitWithLocationInfo
at org.apache.spark.rdd.HadoopRDD.getPreferredLocations(HadoopRDD.scala:329)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:274)
at org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:274)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:273)
... removed ...

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 32 more

Java Code (I am using spark-submit to specify the jar containing this code)

public static void main(final String... args) {
SparkConf conf = new SparkConf().setAppName("hello spark");

JavaSparkContext ctx = new JavaSparkContext(conf);

JavaRDD<String> textload = ctx.textFile("README.md" );

System.out.println(textload.count());
}

pom.xml dependencies

<dependencies>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.2.1</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.4.0</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapred</artifactId>
    <version>0.22.0</version>
</dependency>


</dependencies>

Scala code on command line through spark-shell

sc.textFile("README.md").count

Version Details
Hadoop 2.4.0
Scala 2.11.8
Java 1.8
Apache Spark 2.2.1

What am I missing here ?

OneCricketeer OneCricketeer · Accepted Answer · 2018-06-23T01:26:03

I am new to Apache Spark, Scala and Hadoop

Then you should be using the latest, stable versions of each. For starters, download the latest Spark that includes Hadoop.

hadoop-mapred is a deprecated package and you should not be using two different versions of Hadoop libraries. That explains why you would be getting ClassNotFoundException

If you downloaded Spark from the second link, it includes a version of Hadoop greater than 2.4, and those libraries are included on the Spark classpath, so you should not add them into your POM anyway. Find the Java quickstart POM

I'll also point out that you should actually get HDFS working before you try to run Spark against it (assuming you need to use Hadoop instead of standalone Spark).

But you do not need Hadoop at all to run spark.textFile("README.md" ).count from Spark shell

HadoopRDD error while trying to count lines in a file hosted on local HDFS using spark shell

1 Answers