Hive Hbase JOIN performance & KUDU

Question

Reading the Cloudera documentation using Impala to join a Hive table against HBase smaller tables as stated below, then in the absence of a Big Data appliance such as OBDA and a largish HBase dimension table that is mutable:

If you have join queries that do aggregation operations on large fact tables and join the results against small dimension tables, consider using Impala for the fact tables and HBase for the dimension tables. (Because Impala does a full scan on the HBase table in this case, rather than doing single-row HBase lookups based on the join column, only use this technique where the HBase table is small enough that doing a full table scan does not cause a performance bottleneck for the query.)

Is there any way to get that single key look up in another way?

In addition I noted the following on KUDU and HDFS, presumably HIVE. Does anybody have experience here? Keen to know. I will be tryiong it myself in due course, but installing parcels on non-parcelled quickstarts is not so easy...

Mix and match storage managers within a single application (or query)

• SELECT COUNT(*) FROM my_fact_table_on_hdfs JOIN
my_dim_table_in_kudu ON ...

"Is there any way to get that single key look up in another way" > you mean, read the huge dataset once to extract the list of keys, then retrieve the relevant records from HBase (with multiple GETs in a loop), then read again the huge dataset to perform the lookup? That would be incredibly inefficient, don't you think? — Samson Scharfrichter
I am not making any assumptions on what is best, but have been a VLDB ORACLE DBA with performance and tuning, which is a little different of course. In BIG DATA what is a small table? Your response leads met to the KUDU option. — thebluephantom
HBase is basically a key/value DB, designed for random access and no transactions. Hive is a batch query engine built on top of HDFS (a distributed file system for immutable, large files) and YARN (a resource manager for distributed batch jobs). Hive also has a "connector" to run Full Scans on HBase, but there is a SERIOUS impedance mismatch here... — Samson Scharfrichter
On the other hand, Phoenix attempts to bring some RDBMS features -- primitive data types, table schemas, indexing, transactions -- on top of HBase. And Kudu attempts to bring some RDBMS features -- atomic Insert-Update-Deletes -- as an alternative to HDFS+YARN, but it's a Cloudera initiative, oriented towards Impala and Spark (not Hive...!) — Samson Scharfrichter
Note also that Kudu is still immature, has no serious authentication/authorization/auditing features yet, no serious documentation (even when you are a Cloudera paying customer). — Samson Scharfrichter

thebluephantom thebluephantom · Accepted Answer · 2017-06-07T17:12:35

Erring on the side of caution, linking with KUDU for dimensions would be the way to go so as to avoid a scan on a large dimension in HBASE when a lkp is only required.

I am retracting the latter point, I am sure that a JOIN will not cause an HBASE scan if it is an equijoin.

That said, IMPALA with MPP allows an MPP approach w/o MR and JOINing of dimensions with fact tables. The advantage of the OBDA is less obvious now. imo.

Hive Hbase JOIN performance & KUDU

1 Answers