The Tables:
The company I work for has a Slowly Changing Dimension (employee data) which has been warehoused using the Kimball Method. The dimension table containing this data has a Primary Key (int identity employee_key
, used as a surrogate in other tables), a Natural Key (employee_id
), valid date ranges (valid_date
and invalid_date
) and a variety of SCD1 and SCD2 data elements tracked over time. Here's a simplified example:
employee_key | employee_id | valid_date | invalid_date | employee_name | employee_role
1 | 1001 | 1/1/2015 | 6/1/2015 | Bob | DBA
2 | 1001 | 6/2/2015 | NULL | Bob | Developer
3 | 1002 | 1/1/2015 | NULL | Jill | DBA
In the above example, employee_key
is the primary key (surrogate) and employee_id
is the natural key. The other values should hopefully be self explanatory. This table reflects that:
- Bob was a DBA beginning on 1/1/2015 and ending on 6/1/2015.
- Bob was a Developer beginning on 6/2/2015 and is currently in that role.
- Jill was a DBA beginning on 1/1/2015 and is currently in that role.
Now, we also have a slew of fact tables that reference this dimension. One such fact table contains all the time logged by employees and is granular to the day. We aren't really concerned with the structure of these tables, just that they link to our employee dimension using the surrogate key and they generally contain a lot of rows (between 10M-200M). Here's an example of the fact table containing time logged:
calendar_dt | employee_id | employee_key | time_code | hours
1/1/2015 | 1001 | 1 | 1234 | 2.25
1/1/2015 | 1001 | 1 | 21 | 3.50
1/2/2015 | 1001 | 1 | 21 | 8.00
...
6/1/2015 | 1001 | 1 | 21 | 4.00
Linking to the employee dimension by surrogate key employee_key
serves an important business purpose - it enables accurate historical reporting without an expensive join using the BETWEEN
operator. For instance, it lets us say that time logged by Bob on 6/1/2015 is attributable to his DBA role, and time logged by Bob on 6/2/2015 is attributable to his Developer role.
As far as I know, this is a somewhat standard Kimball implementation.
The Problem:
This implementation does not handle the correction of data very well. Let's say that, in our previous example, HR tells us that Bob transferred to a role of Analyst for a valid date range of 5/1/2015 through 6/1/2015 and that they failed to enter it into the system. This presents us with a major problem: we need to split the row where employee_key = 1
into two rows with different valid/invalid dates. Furthermore, we need to find all places that now erroneously reference employee_key = 1
and update them. Here are the issues:
- We need to run an expensive update operation over a number of enormous tables. We can't afford to do this every time a correction needs to be made.
- The dimension row split needs to be done manually, putting the table at risk of data entry error or valid/invalid date range overlap.
- Splitting a row violates an important rule: that primary keys are immutable and never change once assigned.
The Solution:
I can think of a number of ways of solving this, but none are elegant:
- Just bear with the nightmare of updating the surrogate key data. Maybe force corrections to occur on a normal schedule, reducing the number of times we need to run this update.
- Transform the employee dimension table into a row-per-employee-per-day table. This has the benefit of allowing a natural key join on
employee_id
andcalendar_dt
. It also makes this key immutable and allows the appropriate surrogate key value to be discerned without looking it up in the dimension table. Regardless of how the dimension table changes, the fact tables will always refer to the correct rows. This has the major drawback of transforming our 100,000 row dimension table into a 20M row one.
What other solutions are there? I can't be the only one that's run into this problem... Help me!
Caveats:
- We make the assumption that data will never require a time element (that the granularity will always be at the day level).
- We make the assumption that
employee_id
values will never change (yes, I know this is a dangerous assumption).