0
votes

When keying a FACT table in a data warehouse, is it better best to use the primary key from the foreign table or the unique key or identifier used by the business?

For example (see below illustration), assume you have two dimension tables "DimStores" and "DimCustomers" and one FACT table named "FactSales". Both of the dimension tables have an indexed primary key field that is an integer data type and is named "ID". They also have an indexed unique business key field that is a alpha-numeric text data type named "Number".

Typically you'd use the primary key of dimension tables as the foreign keys in the FACT table. However, I'm wondering if that is the best approach.

By using the primary key, in order to look up or do calculations on the facts in the FACT table, you'd likely have to always do a join query on the primary key and use the business key as your look up. The reason is because most users won't know the primary key value to do a lookup in the FACT table. They will, however, likely know the business key. Therefore to use that business key you'd have to do a join query to make the relationship.

Since the business key is indexed anyway, would it be better to just use that as the foreign key in the FACT table? That way you wouldn't have to do a join and just do your lookup or calculations directly?

I guess it boils down to whether join queries are that expensive? Imagine you're dealing with a billion record FACT table and dimensions with tens of millions of records.

Example tables:

DimStores:
+------------+-------------+-------------+
| StoreId    | StoreNumber | StoreName   |
+------------+-------------+-------------+
| 1          | S001        | Los Angeles |
| 2          | S002        | New York    |
+------------+-------------+-------------+

DimCustomers:
+------------+----------------+--------------+
| CustomerId | CustomerNumber | CustomerName |
+------------+----------------+--------------+
| 1          | S001           | Michael      |
| 2          | S002           | Kareem       |
| 3          | S003           | Larry        |
| 4          | S004           | Erving       |
+------------+----------------+--------------+

FactSales:
+---------+------------+------------+
| StoreId | CustomerId | SaleAmount |
+---------+------------+------------+
| 1       | 1          |      $400  |
| 1       | 2          |      $300  |
| 2       | 3          |      $200  |
| 2       | 4          |      $100  |
+---------+------------+------------+

In the above to get the total sales for the Los Angles store I'd have to do this:

Select Sum(SaleAmount) 
From FactSales FT 
Inner Join DimStores D1 ON FT.StoreId = D1.StoreId
Where D1.StoreNumber = 'S001'

Had I used the "StoreNumber" and "CustomerNumber" fields as the foreign keys instead in the "FactSales" table. I wouldn't have had to do a join query and could have directly done this instead:

Select Sum(SaleAmount) 
From FactSales
Where StoreNumber = 'S001'
1

1 Answers

1
votes

The reason you use artificial primary keys is to isolate the data warehouse from business decisions.

Your business grows. Now you have more than 1000 stores. The keys for the stores change. How do you handle this?

If the store key is spread throughout your data warehouse, this is a painful operation. If the store key is just an attribute on a dimension table, then this is easy.

I should also note that in many cases, the dimensions might be type 2 dimensions -- meaning that they change over time. For instance, customers can change their names, but you might want to know what their name was at a particular point in time.

And a third reason. Artificial primary keys are usually integers. These are better for indexing than strings (particularly strings with variable lengths). The difference in performance is minor, but it is a reason to use the primary keys. In fact, if the keys are strings and are longer than integers, it might be more efficient to use the artificial keys in terms of space.