Please clarify my understanding of Hadoop/HBase

Question

I have been reading white papers and watching youtube videos for half the day now and believe I have a proper understanding of the technology, but before I start my project I want to make sure its right.

So with that, here's what I think I know.

As i'm understanding the architecture of hadoop and hbase, they pretty much model out like this

-----------------------------------------
|               Mapreduce               |
-----------------------------------------
| Hadoop  | <-- hbase export--|  HBase  |
|         |  --apache pig --> |         |
-----------------------------------------
|       HDFS                            |
 ----------------------------------------

In a nutshell HBase is a completely different DB engine tuned for real time updates and queries that happens to run on the HDFS and is compatible with Mapreduce.

Now, assuming the above is correct, here is what else I think I know.

Hadoop is designed for big data from start to finish. The engine uses a distributed append only system which means you can not delete data once its inserted. To access the data you can use Mapreduce, or the HDFS shell and HDFS API..
Hadoop does not like small chunks and it was never intended to be a real time system. You would not want to store a single person and address per file, you would in fact store a million people and addresses per file and insert the large file.
HBase on the other hand is a pretty typical NoSql database engine that in spirit compares to CouchDB, RavenDB, etc. The notable difference is its built using the HDFS from hadoop allowing it to scale reliably to sizes only limited by your wallet.
Hadoop is a collection of File System (HDFS) and Java APIs to perform computation on HDFS. HBase is a NoSql database engine that uses HDFS to efficiently store data across a cluster
To build a Mapreduce job to access data from both Hadoop and HBase, one would be best off to use HBase export to push the HBase data into Hadoop and write your job to process the data, but Mapreduce can access both systems one at a time.
You must be very careful when designing your HBase files as HBase does not natively support indexing fields within that file, HBase only indexes the primary key. Many tips and tricks help work around this fact.

Ok, so if im still accurate to this point, this would be a valid use case.

You build the site with HBase. You use HBase the same as you would any other NoSql or RDBMS to build out your functionality. Once thats done, you put your metrics logging points in the code to record your metrics in say, log4j. You create a new appender in log4j with rules that say when the log file reaches 1 gig in size, push it to the hadoop cluster, delete it, create a new file, go on with life.

Later, a Mapreduce developer can write a routine that uses HBase export to grab a data set from HBase, say a list of user ID's, then go to the logs that are stored in Hadoop and find the bread crumb trail for each user thru the system for a given timespan.

Ok, with that all said, now for the specific question. Are statements 1 - 6 accurate?

**********Edit one, i have updated my beliefs above based on the answers received.

ericson ericson · Accepted Answer · 2013-02-15T05:19:20

You can access the file in HDFS directly via HDFS shell or HDFS API.
Correct.
I am not familiar with CouchDB or RavenDB, but in HBase you can not have secondary-index, so you must carefully design your row key to speed up your query. There are a lot of HBase schema design tips on the internet you can google for.
I think it is more appropriate to say Hadoop is a computing engine to a database engine. If you want to import HDFS data to HBase, you can use Apache Pig as stated in this post. If you want to export HBase data to HDFS, you can use the export utility.
MapReduce is a component of Hadoop framework and it does not sit on top of HBase. You can access HBase data in a MapReduce job because of HBase uses HDFS for its storage. I don't think you want to access the HFile directly from a MapReduce job because the raw file is encoded in a special format, it is not easy to parse and it might change in future releases.

Please clarify my understanding of Hadoop/HBase

2 Answers