2
votes

I installed the package spark-2.0.2-bin-without-hadoop.tgz on a local DEV box but failed to run it below,

$ ./bin/spark-shell
NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

$ ./sbin/start-master.sh
NoClassDefFoundError: org/slf4j/Logger

Did I misinterpret that Spark could spin without Hadoop below?

"Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode."

2

2 Answers

6
votes

For the first issue concerning FSDataInputStream, as noted in this Stack Overflow response https://stackoverflow.com/a/31331528,

the "without Hadoop" is a bit misleading in that this build of Spark is not tied to a specific build of Hadoop as opposed to not running without it. To run Spark using the "without Hadoop" version, you should bind it to your own Hadoop distribution.

For the second issue concerning missing SLF4J, as noted in this Stack Overflow response https://stackoverflow.com/a/39277696 - you can include the SLF4J jar or if you already have a Hadoop distribution installed, then you should already have this up and running.

Saying this, you can download the Apache Spark pre-built with Hadoop and not use Hadoop itself. It contains all the necessary jars and you can specify Spark to read from the file system, e.g. Using the file://// when accessing your data (instead of HDFS).

1
votes

Yes, from the downloads page of Spark, as of today, for Spark 3.1.1 the following package types exist for download:

  1. Pre-built for Apache Hadoop 2.7

This (spark-3.1.1-bin-hadoop2.7.tgz) version of spark runs with Hadoop 2.7

  1. Pre-built for Apache Hadoop 3.2 and later

This (spark-3.1.1-bin-hadoop3.2.tgz) version of spark runs with Hadoop 3.2 and later

  1. Pre-built with user-provided Apache Hadoop

This (spark-3.1.1-bin-without-hadoop.tgz) version of spark runs with any user-provided version of Hadoop.

From the name of last version (spark-3.1.1-bin-without-hadoop.tgz), it appears that we will need HADOOP for this spark version (i.e., 3.) and not the other versions (i.e., 1. and 2.). However, the naming is ambiguous. We will need Hadoop only if we want to support HDFS and YARN. In the Standalone mode, Spark can run in a truly distributed setting (or with daemons on a single machine) without Hadoop.

For 1. and 2., you can run Spark without a Hadoop installation as some of the core Hadoop libraries come bundled with the spark prebuilt binary, hence spark-shell would work without throwing any exceptions); for 3., spark will not work unless a HADOOP installation is provided (as 3. comes without the Hadoop runtime).

In essence,

  • we will need to install Hadoop separately in all three cases (1., 2., and 3.) if we want to support HDFS and YARN
  • if we don't want to install Hadoop, we can use pre-built Spark with hadoop and run Spark in Standalone mode
  • if we want to use any version of Hadoop with Spark, then 3. should be used with a separate installation of Hadoop

For more information, refer this from the docs

There are two variants of Spark binary distributions you can download. One is pre-built with a certain version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it with-hadoop Spark distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution doesn’t contain a built-in Hadoop runtime, it’s smaller, but users have to provide a Hadoop installation separately. We call this variant no-hadoop Spark distribution. For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark ...