Apache Hadoop (2.0) provides two major components, (1) HDFS
the Hadoop Distributed File System, for storing data (i.e. files) on a cluster, and (2) YARN
a cluster compute resource management system (i.e. CPUs/RAM).
Hadoop 2.0:
- Storage Management: HDFS
- Compute Resource Management: YARN
Hadoop (2.0) also provides an execution engine called `MapReduce (MR2 - MapReduce2)' that can use YARN compute resources to execute MapReduce based programs.
Prior to Hadoop (2.0), YARN did not exist, and MapReduce performed both roles of resource management an execution engine. Hadoop (2.0) decoupled compute resource management from execution engines, allowing you to run many types of applications on a Hadoop cluster.
- When people state that Spark is better than Hadoop, they are typically referring to the MapReduce execution engine.
- When people state that Spark can run on Hadoop (2.0), they are typically referring to Spark using YARN compute resources.
A few Hadoop 2.0 Execution Engine Examples:
YARN Resources used to run MapReduce2 (MR2)
YARN Resources used to run Spark
YARN Resources used to run Tez
Spark programs need resources to run and they typically come from either a Spark-standalone cluster, or they get their resources by using YARN resources from a Hadoop cluster; there are other ways to run Spark, but they are not necessary for discussion here.
Like MapReduce programs, Spark programs can also access data stored in HDFS or in other places.