Suppose, hypothetically that I have a star schema in a data warehouse setting. There is one VERY, VERY long fact table (think billions to trillions of rows) and several low cardinality dimension tables (think 100 dimension tables). Each fact table foreign key that points to a dimension table primary key is bitmap indexed. Each dimension table primary key is also bitmap indexed. This is all for fast joins. All pretty standard.
Suppose that the data warehouse is starting to show performance degradation. The time it takes to return results from a bitmapped join is getting worse the longer the fact table gets. The business requirement is that the fact table keeps growing (we can not move data older than a year off to archival storage)
I'm thinking of the following solutions:
- Hash partition the fact table, but this just holds off the inevitable growth issue temporarily.
- Database partition the physical star schema database as multiple schemas/databases. 1..N fact tables and their dimension copies, each holding data assigned to them via a hash(1..N) function that is executed in a separate ETL staging database to determine which database/schema the fact row (resulting from the ETL process) will go into. If any dimension changes, replicate the change to the other databases corresponding dimension. Again, this will not work as a permanent solution.
- Collapse the dimensions and store all dimension values directly in the fact table. Then, import the fact table to HBASE on Hadoop. You get a massive HBASE table, key value store that has no dimension tables. I would do this because joins are cost prohibitative in HBASE (so no fact to dimension joins, just enforce dimension values on the dimension columns).
Has anyone ever done this before?
Does anyone have any tips for solution #3?
Is the HBASE solution optimal as far as scaling up with fast reads?
As far as writes, I don't care for fast writes as they would be done off hours as batch processes.
If anyone has chosen solution 1 or 2, has anyone used a consistent hashing algorithm (to avoid remapping as in a plain old hash if more partitions, hash keys are created dynamically)? Dynamic growth in the number of partitions without a full remap is probably not an option (I haven't seen it done in practice as far as partitioned tables are concerned) so it just seems to me that any partition solution would lead to scaling problems.
Any thoughts, advice and experience with moving a giant fact table with many dimensions (a traditional DW star schema) to an HBASE giant dimensionless table?
Related question:
How do aggregate data collections that traditionally reside in materialized views (or alternately as separate fact tables linked to the same dimensions as the most granular fact table -- i.e. hourly/daily/weekly/monthly where the base fact table is hourly) in a data warehouse map across to HBASE?
My thoughts are that since there are no materialized views in HBASE, the aggregate data collections are stored as HBASE tables that get updated/inserted into any time there are changes to the most granular, lowest level fact table.
Any thoughts on aggregate tables in HBASE? Has anyone used Hive scripts to essentially mimic the behavior of materialized views in updating aggregate column data in secondary HBASE tables that have aggregate data stored in them (i.e. daily_aggregates_fact_table, weekly_aggregates_fact_table, monthly_aggregates_fact_table) on a change to the most granular fact table?