4
votes

Suppose, hypothetically that I have a star schema in a data warehouse setting. There is one VERY, VERY long fact table (think billions to trillions of rows) and several low cardinality dimension tables (think 100 dimension tables). Each fact table foreign key that points to a dimension table primary key is bitmap indexed. Each dimension table primary key is also bitmap indexed. This is all for fast joins. All pretty standard.

Suppose that the data warehouse is starting to show performance degradation. The time it takes to return results from a bitmapped join is getting worse the longer the fact table gets. The business requirement is that the fact table keeps growing (we can not move data older than a year off to archival storage)

I'm thinking of the following solutions:

  1. Hash partition the fact table, but this just holds off the inevitable growth issue temporarily.
  2. Database partition the physical star schema database as multiple schemas/databases. 1..N fact tables and their dimension copies, each holding data assigned to them via a hash(1..N) function that is executed in a separate ETL staging database to determine which database/schema the fact row (resulting from the ETL process) will go into. If any dimension changes, replicate the change to the other databases corresponding dimension. Again, this will not work as a permanent solution.
  3. Collapse the dimensions and store all dimension values directly in the fact table. Then, import the fact table to HBASE on Hadoop. You get a massive HBASE table, key value store that has no dimension tables. I would do this because joins are cost prohibitative in HBASE (so no fact to dimension joins, just enforce dimension values on the dimension columns).

Has anyone ever done this before?

Does anyone have any tips for solution #3?

Is the HBASE solution optimal as far as scaling up with fast reads?

As far as writes, I don't care for fast writes as they would be done off hours as batch processes.

If anyone has chosen solution 1 or 2, has anyone used a consistent hashing algorithm (to avoid remapping as in a plain old hash if more partitions, hash keys are created dynamically)? Dynamic growth in the number of partitions without a full remap is probably not an option (I haven't seen it done in practice as far as partitioned tables are concerned) so it just seems to me that any partition solution would lead to scaling problems.

Any thoughts, advice and experience with moving a giant fact table with many dimensions (a traditional DW star schema) to an HBASE giant dimensionless table?

Related question:

How do aggregate data collections that traditionally reside in materialized views (or alternately as separate fact tables linked to the same dimensions as the most granular fact table -- i.e. hourly/daily/weekly/monthly where the base fact table is hourly) in a data warehouse map across to HBASE?

Aggregate fact tables partial star schema

My thoughts are that since there are no materialized views in HBASE, the aggregate data collections are stored as HBASE tables that get updated/inserted into any time there are changes to the most granular, lowest level fact table.

Any thoughts on aggregate tables in HBASE? Has anyone used Hive scripts to essentially mimic the behavior of materialized views in updating aggregate column data in secondary HBASE tables that have aggregate data stored in them (i.e. daily_aggregates_fact_table, weekly_aggregates_fact_table, monthly_aggregates_fact_table) on a change to the most granular fact table?

2
why you consider HBASE and not Hive?David Gruzman
What do you want to do with the HBase table? how are you going to access it ? what sort of questions will you want to answer? There's no "general way" to do it since HBase does not provide a general SQL-Like mechanism to query itArnon Rotem-Gal-Oz
Check out SO - there are bunch of advise on setting up this kind of database as a partitioned Hive table. You'll thank its (nearly) SQL capability.Olaf
That's all fine but note that HIVEQL will result in map/reduce jobs not instant results.Arnon Rotem-Gal-Oz
How frequently is your fact data updated? Do updates happen all over the place or are they concentrated in the data for the few latest days? In the cases that are more typical for the datawarehouse applications Hive table partitioning can go long ways.Olaf

2 Answers

1
votes

Dimension will be defined as keyrow in HBase. The value is your measure value. If your fact tables are factless, the value in HBase row can be null.

Depends on the poor resources from Internet, I think the idea is:

**RowKey**                                **Value**
DimensionA                             XX
DimensionA:DimensionB                  XX
DimensionB:DimensionC                  XX
DimenesionA:DimensionB:DimenesionC:   XXX

Is it suitable for your problems?

0
votes

HBase is not a good choice for a general purpose data warehouse (with real-timish query times) Any single table will only allow you to drill down along one dimension or along one path through dimensions (if you design the right composite key right). It isn't undoable (e.g. ebay built their new search engine on HBase) but it isn't out of the box

There are several efforts to provide high-performance SQL over Hadoop (e.g. Hadapt or Rainstor ) but they won't give you the performance of a good massively parallel databases like Vertica, Greenplum, Asterdata, Netezza and the like