2
votes

I am trying to run Spark ML algorithms with an enironment that doesn't contain Hadoop at all.

I haven't figured out from tutorials and other posts if this is possible or not:
Can I run Spark without using any version of Hadoop and any HDFS? Or should I install Hadoop in order to Spark?

When running Spark shell I am getting the following message:

C:\spark-2.2.0-bin-without-hadoop\bin>spark-shell
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
        at org.apache.spark.deploy.SparkSubmitArguments$$anonfun$mergeDefaultSparkProperties$1.apply(SparkSubmitArguments.scala:124)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.deploy.SparkSubmitArguments.mergeDefaultSparkProperties(SparkSubmitArguments.scala:124)
        at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:110)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 7 more

Below is my sample program:

package com.example.spark_example;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;


public class Main {


  public static void main(String[] args) {
    String logFile = "C:\\spark-2.2.0-bin-without-hadoop\\README.md"; // Should be some file on your system
    SparkConf conf = new SparkConf().setAppName("Simple Application");
    JavaSparkContext sc = new JavaSparkContext(conf);
    JavaRDD<String> logData = sc.textFile(logFile).cache();

    long numAs = logData.filter((Function<String, Boolean>) s -> s.contains("a")).count();

    long numBs = logData.filter((Function<String, Boolean>) s -> s.contains("b")).count();

    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

    sc.stop();
  }

}

Which causes the following exception:

17/08/10 15:23:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/10 15:23:35 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
2

2 Answers

2
votes

Can I run Spark without using any version of Hadoop

You cannot. While Spark doesn't require Hadoop cluster (YARN, HDFS) it depends on Hadoop libraries. If you don't have Hadoop installation which provides these, please use complete builds describe as pre-built for Apache Hadoop. In you case:

spark-2.2.0-bin-hadoop2.7
1
votes

If you downloaded Apache Spark with prebuild package type, you have all libraries needed. To resolve your issue you need install winutils -- a Windows libraries for hadoop.

Just copy all files from folder to your folder

%SPARK_HOME%\bin

And add environment variable %HADOOP_HOME% with value %SPARK_HOME%