0
votes

I am creating an SSIS package that loads data from a source system into a Data Warehouse using lookups. Here is how the process is currently set up:

  1. Load all relevant tables to staging tables (6 lookup tables and the data for the fact table)
  2. For each lookup table use a Lookup component to determine if the record already exists in the data warehouse dimension table. If it does match get the ID from the dimension, if it does not match derive a match.
  3. Load into the fact table the IDs to the dimensions and two or three other pieces of data that flowed through from the source.

Not very complex. The lookup tables have anywhere from 10 - 50 records. I have limited my source data to 10,000 records in order for the package to complete in a reasonable amount of time. Takes 20 minutes to load 10,000 records.

The time line in the messages shows:

  • 11:13:28 - Package Started
  • 11:13:36 - All the lookups were prepared and ready
  • 11:32:33 - The final commit for the data insertion in 'Target - Fact Insert new record has ended'
  • 11:32:34 - Package Ended

Where can I look to see why it took almost 19 minutes to insert 10,000 records? Any suggestions on what I can do differently to make this run faster?

Edit: Responses to questions below

Data Types - Not sure what you are asking? The data types of what I am looking up? integers are the ID fields. The Code field are either integers or Char(4).

Data Flow of main process is: Source (run query to get info from stage table), Lookup a LocationID, match and unmatched to Union, Lookup an Injury CodeID, match and unmatched to Union, repeat for the next 4 lookup tables, sort the source, merge join my target (currently an empty table), determine Add, Change or Delete record (these are all Adds).

The target table is:

CREATE TABLE dbo.tfResidentEvent
    (
      ResidentEventID INT IDENTITY(1, 1)
                           PRIMARY KEY ,
      SourceSystemID INT ,
      EventID INTEGER ,
      EventCodeID INT ,
      FacilityID INT  ,
      ResidentID INT  ,
      DateID INT  ,
      TimeID INT  ,
      LocationCodeID INT  ,
      CauseCodeID INT  ,
      ConditionCodeID INT ,
      InjuryCodeID INT ,
      chksum_type1 int,
      chksum_type2 int
    )

The timings are coming from the Messages shown on the server for the package execution.

SSIS on SQL Server 2012

I will try to re-do the package with derived columns and see how long that takes to execute.

1
What are your data types? What does your Data Flow Look like? Ar there triggers on the target table? What is the definition of the target table? If you replace the destination with a Derived Column and run the package, how long does that take? Did you collect these timings running from the command line/SQL Agent or was this executed in the context of the Visual Studio/BIDS/SSDT? What version of SSIS are you using? - billinkc
What do you use for finding out if the records are create/update/delete? from your post this part seems to be inside the 19 minute window and this type of operation is notoriously slow in SSIS. Also, what component does the insert and how is it configured? Finally, did you turn of your indexes prior to inserting and rebuild them later? (another "known" speedbump) - Tristan
If I query tfResidentEvent during the "insert" I get no records returned until after the package finishes. I am using an ODE DB Destination to do the insert using the Table or View - fast load option. Currently there are no indexes on the table except for the PK. - Leslie

1 Answers

0
votes

I removed all the LOOKUP functions from the SSIS package. In my source I used a single query using LEFT OUTER JOINS in order to get FK_IDs or NULL in the appropriate fields. I sorted both my source and target in the query and then used the MERGE JOIN to identify whether the package should Add/Update/Delete.

Once deployed on the server the insert of 10,000 records went from 19+ minutes to 17 seconds. I was able to load the entire table of 650K records in 10 minutes.