0
votes

Objective

Suppose you're building Data Lake and Star Schema with help of ETL. Storage format is Delta Lake. One of the ETL responsibilities is to build Slowly Changing Dimension (SCD) tables (cummulative state). This means that every day for every SCD table ETL reads full table's state, applies updates and saves them back (full overwrite).

Question

One of the questions we argued within my team: should we add time partition to SCD (full overwrite) tables? Means, should I save the latest (full) table state to SOME_DIMENSION/ or to SOME_DIMENSION/YEAR=2020/MONTH=12/DAY=04/?

Considerations

In one hand, Delta Lake has all required features: time-travel & ACID. When its overwritting the whole table, logical deletion happens, and you're still able to query old versions and rollback to them. So Delta Lake is almost managing time partition for you, the code get simpler.

In other hand, I said "almost" because IMHO time-travel & ACID don't cover 100% of use cases. It hasn't got a notion of arrival time. For example:

Example (when you need time partition)

BA team reported that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 data are broken (facts must be stored with time partition any case, because data are processed by arrival time). In order to reproduce the issue on DEV/TEST environment you need 1 fact table raw inputs and 10 SCD tables.

With facts everything is simple, because you have raw inputs in Data Lake. But with incremental state (SCD tables) things get complex - how to get the state of 10 SCD tables for the point in time when SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was processed? How to do this automatically?

To complicate the things even more, your environment may come through bunch of bugfixes and history re-processings. Means 2019-07 data may be reprocessed somewhere in 2020. And Delta Lake allow you to rollback only based on processing or version number. So you actually don't know which version you should use.

In other hand, with date partitioning, you are always sure that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was calculated over SOME_DIMENSION/YEAR=2019/MONTH=07/DAY=15.

1

1 Answers

1
votes

It depends, and I think it's a bit more complicated.

Some context first - Delta gives you time travel only limited to the current commit history, which is by default 30 days. If you are doing optimizations, that time might be significantly shorter (default 7 days). Also, you actually can query Delta tables as of specific time, not only version, but due to above limitations (unless you are willing to pay the performance and financial cost of storing really long commit history), it's not useful from long-term perspective.

This is why a very common data lake architecture right now is medallion tables approach (Bronze->Silver->Gold). Ideally, I'd want to store the raw inputs in the 'bronze' layer, have a whole historical perspective in the silver layer (already clean, validated, best source of truth, but with whole history as needed), and consume the current version directly from "golden" tables.

This would avoid increasing the complexity of querying the SCDs due to additional partitions, while giving you the option to "go back" to silver layer if need arises. But it's always a tradeoff decision - in any case, don't rely on Delta for long-term versioning.