3
votes

I am building a staging area that gets data from informatica cdc. Now for example lets say I am replication two tables for incremental load. I have to delete the processed data from the staging tables after each load. I join these two tables to populate my target dimension. Problem is change can happen on only one source and not the other in a particular load.

Example:

Employee
---------
ID   NAME   
1    PETER  

EmployeeSal
------------
EMPID  SAL

1      2000

If the above is replicated in my first load, I join the two tables and load them thats fine.

Now lets say the salary of peter is updated frrom 2000 to 3000. As I have delete my staging tables after each load, I will have the following for current load.

Employee
---------
ID   NAME   

EmployeeSal
-----------
EMPID  SAL

1     3000

Here is my problem ss I have to populate the whole row of the dimension which is TYPE2.

  1. I have to join back to the source to get the other attributes of employee table ( This is just a lame example, In reality it might be 10 tables and hundreds of thousand of changes). Is it recommended to go back to the source?

  2. I can join the target table to this mix and populate the missing attributes.
    Is this even recommended as I have to do lot of case statements , nullhandlings etc if a particular staging table has no change for a dimension record. My question is is it even common that target table is joined in a ETL transformation?

2
Thanks should never be part of a good question. Indent preformatted blocks of text by selecting them and then Ctrl+K - Anthon

2 Answers

0
votes

Going back to the source system, invalidates the purpose of creating the staging area in the first place. It is usually not recommended.

However querying the target table is quite common to get previous information. But it is true that you have to do a lot of checks.

Another option is to maintain a scd type 1 in your staging area. Maintain insert and update timestamps in staging which you can use to get only the changed records while loading the dimension.

0
votes

I have encountered the same problem with Order Header and Details - if the detail changes I need to be able to inner join to the Header to update my flattened Order Fact table.

I resolved it this way: After I have staged all the changed records, I look up any missing Headers for the changed Details (an vice versa) using an SQL Task that loads an object with an array of Order IDs. And I use a for each loop to get load the missing Headers into staging.

My inner joins from Staging to the data warehouse will now work as expected