0
votes

There is a whole lot of hadoop ecosystem pictures on the internet, so i struggle to get an understanding how the tools work together.

E.g. in the picture attached, why are pig and hive based on map reduce whereas the other tools like spark or storm on YARN?

Would you be so kind and explain this?

Thanks! BR

haddop ecosystem

1
Your question is not consistent. "Pig and Spark" vs "Spark and Storm"?!? Did you mean "Pig and Hive"?Samson Scharfrichter
Read about Hadoop V1 (only MapReduce for resource allocation and execution logic) vs Hadoop V2 (YARN for resource allocation, multiple exec frameworks like MR, Tez, Spark-on-Yarn, etc)Samson Scharfrichter
Pig and Hive is what i meant! I'm sorry for this mistake!madtesa

1 Answers

1
votes

The picture shows Pig and Hive on top of MapReduce. This is because MapReduce is a distributed computing engine that is used by Pig and Hive. Pig and Hive queries get executed as MapReduce jobs. It is easier to work with Pig and Hive, since they give a higher-level abstraction to work with MapReduce.

Now let's take a look at Spark/Storm/Flink on YARN in the picture. YARN is a cluster manager that allows various applications to run on top of it. Storm, Spark and Flink are all examples of applications that can run on top of YARN. MapReduce is also considered as an application that can run on YARN, as shown in the diagram. YARN handles the resource management piece so that multiple applications can share the same cluster. (If you are interested in another example of a similar technology, check out Mesos).

Finally, at the bottom of the picture is HDFS. This is the distributed storage layer that allows applications to store and access data. It provides features such as distributed storage, replication and fault tolerance.

If you are interested in deeper-dives, check out the Apache Projects page.