How is Spark different from Hadoop?

Question

I am trying to learn Spark framework. On its homepage https://spark.apache.org/ it is said that it is better than Hadoop framework. But then they say: Spark runs on Hadoop... I really don't understand why it is possible to run on Hadoop when it should be better than Hadoop.

Can someone explain me the hierarchy between those two?

Andrew Mo Andrew Mo · Accepted Answer · 2017-10-22T18:58:56

Apache Hadoop (2.0) provides two major components, (1) HDFS the Hadoop Distributed File System, for storing data (i.e. files) on a cluster, and (2) YARN a cluster compute resource management system (i.e. CPUs/RAM).

Hadoop 2.0:

Storage Management: HDFS
Compute Resource Management: YARN

Hadoop (2.0) also provides an execution engine called `MapReduce (MR2 - MapReduce2)' that can use YARN compute resources to execute MapReduce based programs.

Prior to Hadoop (2.0), YARN did not exist, and MapReduce performed both roles of resource management an execution engine. Hadoop (2.0) decoupled compute resource management from execution engines, allowing you to run many types of applications on a Hadoop cluster.

When people state that Spark is better than Hadoop, they are typically referring to the MapReduce execution engine.
When people state that Spark can run on Hadoop (2.0), they are typically referring to Spark using YARN compute resources.

A few Hadoop 2.0 Execution Engine Examples:

YARN Resources used to run MapReduce2 (MR2)
YARN Resources used to run Spark
YARN Resources used to run Tez

Spark programs need resources to run and they typically come from either a Spark-standalone cluster, or they get their resources by using YARN resources from a Hadoop cluster; there are other ways to run Spark, but they are not necessary for discussion here.

Like MapReduce programs, Spark programs can also access data stored in HDFS or in other places.

How is Spark different from Hadoop?

3 Answers