Using SSIS Package, How to validate the source records for duplicate before inserting?

Question

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?

Our source file is a .csv. We are facing duplicate records loaded in the staging table.

At present , we are following manual process of loading data.

How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.

We are not Truncate the staging table. We are keeping records as is.

Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.

Do you want to just ignore duplicates, or do you want to send them soewhere? — FrankPl
Are you talking about duplicates in the source or are you saying that when you rerun the process, records are added again? — Nick.McDermaid
This specific scenario is related to reloading the same file more than once. Not related to the duplicate records in the source. I haven't come across having duplicate record in the source. — goofyui
Please edit your questio and add "I also need to fetch the name of the source file". It would also help to define the version of SQL Server in the tags. The normal load pattern is: 1. TRUNCATE Staging Table; 2. Load all data from CSV into staging table; 3. Merge staging table into final table. Do you have two tables in this process or only one? The problem you have has many solutions but there isn't enough description of your existing process. — Nick.McDermaid

Nick.McDermaid Nick.McDermaid · Accepted Answer · 2018-07-05T05:46:48

The typical load pattern I use in this case is:

Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.

I prefer to do this last step in a SQL Task also:

INSERT INTO FinalTable 
(PrimaryKey,Column1,Column2,Column3)
SELECT 
PrimaryKey,Column1,Column2,Column3 
FROM StagingTable SRC
WHERE NOT EXISTS (
    SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);

If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.

Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'

Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result

Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.

Using SSIS Package, How to validate the source records for duplicate before inserting?

3 Answers

What I did as a solution