9
votes

I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.

Questions:

1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?

2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?

3
for question number 2, what was your final decision and what was the reason for it?Jae Park
to cherry-pick specific records instead of doing massive batch processingJohan
sorry which means? u chose..?Jae Park
The previous comments answers "when you should add HBASE between HDFS and SPARK instead of using directly HDFS?"Johan

3 Answers

6
votes

1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?

At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.

2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?

Small reads, concurrent write/read patterns, incremental updates (most etl)

Good luck...

5
votes

I'd say that using distributed computing engines like Apache Hadoop or Apache Spark imply basically a full scan of any data source. That's the whole point of processing the data all at once.

HBase is good at cherry-picking particular records, while HDFS certainly much more performant with full scans.

When you do a write to HBase from Hadoop or Spark, you won't write it to database is usual - it's hugely slow! Instead, you want to write the data to HFiles directly and then bulk import them into.

The reason people invent SQL databases is because HDDs were very very slow at that time. It took the most clever people tens of years to invent different kind of indexes to clever use the bottleneck resource (disk). Now people try to invent NoSQL - we like associative arrays and we need them be distributed (that's what essentially what NoSQL is) - they're very simple and very convenient. But in todays world with SSDs being cheap no one needs databases - file system is good enough in most cases. The one thing, though, is that it has to be distributed to keep up the distributed computations.

Answering original questions:

  1. These are two different tools for completely different problems.

  2. I think if you use Apache Spark for data analysis, you have to avoid HBase (Cassandra or any other database). They can be useful to keep aggregated data to build reports or picking specific records about users or items, but that's happen after the processing.

4
votes

Hbase is a No SQL data base that works well to fetch your data in a fast fashion. Though it is a db, it used large number of Hfile(similar to HDFS files) to store your data and a low latency acces.

So use Hbase when it suits a requirement that your data needs to accessed by other big data.

Spark on the other hand, is the in-memory distributed computing engine which have connectivity to hdfs, hbase, hive, postgreSQL,json files,parquet files etc. There is no considerable performance change while reading from a HDFS file or Hbase upto some gbs. After that Hbase connectivity is becoming faster....