5
votes

I am somewhat new to Apache Hadoop. I have seen this and this questions about Hadoop, HBase, Pig, Hive and HDFS. Both of them describe comparisons between above technologies.

But, I have seen that, typically a Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban).

Can some one explain the relationship of those components/technologies with their responsibilities inside a Hadoop environment, in an architectural workflow manner? preferably with an example?

2

2 Answers

7
votes

General Overview:

HDFS is Hadoop's Distributed File System. Intuitively you can think of this as a filesystem that spans many servers.

HBASE is a column oriented datastore. It is modeled after Google's Big Table, but if that's not something you knew about then think of it as a non-relational database that provides real time read/write access to data. It is integrated into Hadoop.

Pig and Hive are ways of querying for data in the Hadoop ecosystem. The main difference being that Hive is more like SQL than Pig. Pig uses what is called Pig Latin.

Azkaban is a prison, I mean batch workflow job scheduler. So basically it's similar to Oozie in that you can run map/reduce, pig, hive, bash, etc as a single job.

At the highest level possible, you can think of HDFS being your filesystem with HBASE as the datastore. Pig and Hive would be your means of querying from your datastore. Then Azkaban would be your way of scheduling jobs.

Stretched Example:

If you're familiar with Linux ext3 or ext4 for a filesystem, MySQL/Postgresql/MariaDB/etc for a database, SQL to access the data, and cron to schedule jobs. (You can interchange ext3/ext4 for NTFS and cron for Task Scheduler on Windows)

HDFS takes the place of ext3 or ext4 (and is distributed), HBASE takes the database role (and is non-relational!), Pig/Hive are a way to access the data, and Azkaban is a way to schedule jobs.

NOTE: This is not an apples to apples comparison. It's just to demonstrate that the Hadoop components are an abstraction meant to give you a workflow that you're already probably familiar with.

I highly encourage you to look into the components further as you'll have a good amount of fun. Hadoop has so many interchangeable components ( Yarn,Kafka,Oozie,Ambari, ZooKeeper, Sqoop, Spark, etc) that you'll be asking this question to yourself a lot.

EDIT: The links you posted went more into detail about HBase and Hive/Pig so I tried to give an intuitive picture of how they all fit together.

2
votes

Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban). Short description of them can be:-

HDFS -storage in hadoop framework.

HBase - It is columnar database. where you store data in the form of column for faster access. yes , it does use hdfs as its storage.

Pig - data flow language, its community has provided builtin functions to load and process semi structured data like json and xml along with structured data.

Hive - Querying language to run queries over tables, table mounting is necessary here to play with HDFS data.

Azkaban - If you have pipeline of hadoop jobs, you can schedule them to run at specific timings and after or before some dependency.