Objective
Suppose you're building Data Lake and Star Schema with help of ETL. Storage format is Delta Lake. One of the ETL responsibilities is to build Slowly Changing Dimension (SCD) tables (cummulative state). This means that every day for every SCD table ETL reads full table's state, applies updates and saves them back (full overwrite).
Question
One of the questions we argued within my team: should we add time partition to SCD (full overwrite) tables? Means, should I save the latest (full) table state to SOME_DIMENSION/
or to SOME_DIMENSION/YEAR=2020/MONTH=12/DAY=04/
?
Considerations
In one hand, Delta Lake has all required features: time-travel & ACID. When its overwritting the whole table, logical deletion happens, and you're still able to query old versions and rollback to them. So Delta Lake is almost managing time partition for you, the code get simpler.
In other hand, I said "almost" because IMHO time-travel & ACID don't cover 100% of use cases. It hasn't got a notion of arrival time. For example:
Example (when you need time partition)
BA team reported that SOME_FACT/YEAR=2019/MONTH=07/DAY=15
data are broken (facts must be stored with time partition any case, because data are processed by arrival time). In order to reproduce the issue on DEV/TEST environment you need 1 fact table raw inputs and 10 SCD tables.
With facts everything is simple, because you have raw inputs in Data Lake. But with incremental state (SCD tables) things get complex - how to get the state of 10 SCD tables for the point in time when SOME_FACT/YEAR=2019/MONTH=07/DAY=15
was processed? How to do this automatically?
To complicate the things even more, your environment may come through bunch of bugfixes and history re-processings. Means 2019-07 data may be reprocessed somewhere in 2020. And Delta Lake allow you to rollback only based on processing or version number. So you actually don't know which version you should use.
In other hand, with date partitioning, you are always sure that SOME_FACT/YEAR=2019/MONTH=07/DAY=15
was calculated over SOME_DIMENSION/YEAR=2019/MONTH=07/DAY=15
.