SQL Datawarehousing, need help populating my DIMENSION using TSQL SELECT or a better alternative?

Question

I have a table in my SQL Server where I "stage" my datawarehouse extract from our ERP system.

From this staging table (table name: DBO.DWUSD_LIVE) , I build my dimensions and load my fact data.

An example DIMENSION table is called "SHIPTO", this dimensions has the following columns:

"shipto_id
"shipto"
"salpha"
"ssalpha"
"shipto address"
"shipto name"
"shipto city"

Right now I have an SSIS package that does a SELECT DISTINCT across the above columns to retrieve the "unique" data, then through the SSIS package I assign the "shipto_id" surrogate key to.

An example of my current TSQL Query is:

SELECT DISTINCT
"shipto", "salpha", "ssalpha", "shipto address", "shipto name", "shipto city"
FROM DBO.DWUSD_LIVE

This works great but is not "speedy", some dimensions have 10 columns and doing a distinct select on those is not ideal.

In this dimension, my "Business Key" columns are "SHIPTO", "SALPHA", and "SSALPHA".

So if I do:

SELECT DISTINCT
"shipto", "salpha", "ssalpha"
FROM DBO.DWUSD_LIVE

It yields the same results as:

SELECT DISTINCT
"shipto", "salpha", "ssalpha", "shipto address", "shipto name", "shipto city"
FROM DBO.DWUSD_LIVE

Is there a better way to do this TSQL QUERY? I need all the columns, but only DISTINCT on the business key columns.

Your help is appreciated.

Below is an image of how my project is setup in SSIS, the Dimensions is a SCD 1.

If anyone knows of a better way for me to populate my dimension from the staging table based on business keys and surrogate keys, please let me know. — user1709091

Pondlife Pondlife · Accepted Answer · 2012-10-01T16:45:35

I would start by splitting this into two operations: generating the surrogate key and populating the dimension table. The first step will then be a DISTINCT on only 3 columns, and the second step will become a JOIN. Indexing the columns used in both operations might then give you some improvement.

You can combine the DISTINCT with NOT EXISTS to avoid processing rows that have already been mapped, something like this:

insert into dbo.KeyMappingTable (shipto, salpha, ssalpha)
select distinct shipto, salpha, ssalpha
from dbo.Source
where not exists (
    select *
    from dbo.KeyMappingTable
    where shipto = dbo.Source.shipto and salpha = dbo.Source.salpha and ssalpha = dbo.Source.ssalpha
 )

Then you have the mapping, so you can do this:

insert into dbo.DimShipTo (shipto_id, shipto /*, etc. */)
select
    m.shipto_id,
    s.shipto -- etc.
from
    dbo.KeyMappingTable m
    join dbo.Source s
    on m.shipto = s.shipto and m.salpha = s.salpha and m.ssalpha = s.ssalpha
where
    not exists (
        select *
        from dbo.DimShipTo
        where shipto_id = m.shipto_id
    )

You should also look at MERGE, which is convenient if you're using a Type 1 dimension and just want to update addresses or other attributes when they change (and it's a useful command in general). But it's only available from SQL Server 2008; you didn't mention what version or edition of SQL Server you're using.

SQL Datawarehousing, need help populating my DIMENSION using TSQL SELECT or a better alternative?

1 Answers