The picture shows Pig
and Hive
on top of MapReduce
. This is because MapReduce
is a distributed computing engine that is used by Pig
and Hive
. Pig
and Hive
queries get executed as MapReduce
jobs. It is easier to work with Pig
and Hive
, since they give a higher-level abstraction to work with MapReduce
.
Now let's take a look at Spark
/Storm
/Flink
on YARN
in the picture. YARN
is a cluster manager that allows various applications to run on top of it. Storm
, Spark
and Flink
are all examples of applications that can run on top of YARN
. MapReduce
is also considered as an application that can run on YARN
, as shown in the diagram. YARN
handles the resource management piece so that multiple applications can share the same cluster. (If you are interested in another example of a similar technology, check out Mesos
).
Finally, at the bottom of the picture is HDFS
. This is the distributed storage layer that allows applications to store and access data. It provides features such as distributed storage, replication and fault tolerance.
If you are interested in deeper-dives, check out the Apache Projects page.