Yes, from the downloads page of Spark, as of today, for Spark 3.1.1 the following package types exist for download:
- Pre-built for Apache Hadoop 2.7
This (spark-3.1.1-bin-hadoop2.7.tgz) version of spark runs with Hadoop 2.7
- Pre-built for Apache Hadoop 3.2 and later
This (spark-3.1.1-bin-hadoop3.2.tgz) version of spark runs with Hadoop 3.2 and later
- Pre-built with user-provided Apache Hadoop
This (spark-3.1.1-bin-without-hadoop.tgz) version of spark runs with any user-provided version of Hadoop.
From the name of last version (spark-3.1.1-bin-without-hadoop.tgz), it appears that we will need HADOOP for this spark version (i.e., 3.) and not the other versions (i.e., 1. and 2.). However, the naming is ambiguous. We will need Hadoop only if we want to support HDFS and YARN. In the Standalone mode, Spark can run in a truly distributed setting (or with daemons on a single machine) without Hadoop.
For 1. and 2., you can run Spark without a Hadoop installation as some of the core Hadoop libraries come bundled with the spark prebuilt binary, hence spark-shell
would work without throwing any exceptions); for 3., spark will not work unless a HADOOP installation is provided (as 3. comes without the Hadoop runtime).
In essence,
- we will need to install Hadoop separately in all three cases (1., 2., and 3.) if we want to support HDFS and YARN
- if we don't want to install Hadoop, we can use pre-built Spark with hadoop and run Spark in Standalone mode
- if we want to use any version of Hadoop with Spark, then 3. should be used with a separate installation of Hadoop
For more information, refer this from the docs
There are two variants of Spark binary distributions you can download. One is pre-built with a certain version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it with-hadoop Spark distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution doesn’t contain a built-in Hadoop runtime, it’s smaller, but users have to provide a Hadoop installation separately. We call this variant no-hadoop Spark distribution. For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark ...