1
votes

I'm designing a data warehouse using dimensional modeling. I've read most of the Data Warehouse Toolkit by Kimbal & Ross. My question is regarding the columns in a dimensional table that hold dates. For example, here is a table for Users of the application:

CREATE TABLE user_dim (
   user_key BIGINT,  -- surrogate key
   user_id BIGINT,   -- natural key
   user_name VARCHAR(100),
   ...
   user_added_date DATE, -- type 0, date user added to the system
   ...
   -- Type-2 SCD administrative columns
   row_start_date DATE, -- first effective date for this row
   row_end_date DATE,   -- last effective date for this row, 9999-12-31 if current
   row_current_flag VARCHAR(10), -- current or expired
)

The last three attributes are for implementing type 2 slowly-changing dimensions. See Kimbal page 150-151.

Question 1: Is there are best practice for the data type of the row_start_date and row_end_date columns? The type could be DATE (as shown), STRING/VARCHAR/CHAR ("YYYY-MM-DD"), or even BIGINT (foreign key to Date Dimension). I don't think there would be much filtering on the row start/end dates, so a key to the Date Dimension is not required.

Question 2: Is there a best practice for the data type of dimension attributes such "user_added_date"? I can see someone wanting reports on users added per fiscal quarter, so using a foreign key to Date Dimension would be helpful. Any downsides to this, besides having to join from User Dimension to Date Dimension for display of the attribute?

If it matters, I'm using Amazon Redshift.

2

2 Answers

1
votes

Question 1 : For the SCD from and to dates I suggest you use timestamp. My preference is WITHOUT time zone and ensure all of your timestamps are UTC

Question 2 : I always set up a date dimension table with a logical key of the actual date. that way you can join any date (e.g. the start date of the user) to the date dimension to find the eg "fiscal month" or whatever off the date dimension. But also you can see the date without joining to the date dimension as its plain to see (stored as a date)

With redshift (or any columnar MPP DBMS) it is good practice to denormalise a little. e.g. use star schema rather than snowflake schema. This is because of the efficiencies that columnar brings, and deals with the inneficient joins (because there are no indexes)

1
votes

For Question 1: row_start_date and row_end_date are not part of the incoming data. As you mentioned they are created artifially for SCD Type 2 purposes, so they should not have a key to Date dimension. User dim has no reason to have a key to Date dimension. For data type YYYY-MM-DD should be fine.

For Question 2: If you have a requirement like this I would suggest creating a derived fact table (often called accumulating snapshot fact table) to keep derived measures like user_added_date

For more info see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/accumulating-snapshot-fact-table/