1
votes

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?

Our source file is a .csv. We are facing duplicate records loaded in the staging table.

At present , we are following manual process of loading data.

How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.

We are not Truncate the staging table. We are keeping records as is.

Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.

3
Do you want to just ignore duplicates, or do you want to send them soewhere? - FrankPl
Are you talking about duplicates in the source or are you saying that when you rerun the process, records are added again? - Nick.McDermaid
This specific scenario is related to reloading the same file more than once. Not related to the duplicate records in the source. I haven't come across having duplicate record in the source. - goofyui
I also need to fetch the name of the source file. - goofyui
Please edit your questio and add "I also need to fetch the name of the source file". It would also help to define the version of SQL Server in the tags. The normal load pattern is: 1. TRUNCATE Staging Table; 2. Load all data from CSV into staging table; 3. Merge staging table into final table. Do you have two tables in this process or only one? The problem you have has many solutions but there isn't enough description of your existing process. - Nick.McDermaid

3 Answers

2
votes

The typical load pattern I use in this case is:

  1. Prepare a staging table that matches the source file
  2. In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
  3. Then, run a data flow task that loads the entire data file into the staging table
  4. Lastly, merge the staging table into the final table.

I prefer to do this last step in a SQL Task also:

INSERT INTO FinalTable 
(PrimaryKey,Column1,Column2,Column3)
SELECT 
PrimaryKey,Column1,Column2,Column3 
FROM StagingTable SRC
WHERE NOT EXISTS (
    SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);

If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.

Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'

Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result

Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.

1
votes
  1. Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.

  2. How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.

To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.

1
votes

Two years back I have faced the same problem with importing TSV files. I tried many other solutions but best I could design is C# code script for such validation at its best.

What I did as a solution

  • Create one C# DataTable object in memory with Primary Key constraints,

    like:-

DataColumn[] keyColumn = new DataColumn[30];

keyColumn[intJ] = dtFilterdPK.Columns["Column name"];

  • Then try to add one by one row from your CSV to this DataTables.
  • Whenever your data will get Duplication based on Primary Key will have an error
  • Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
  • Avoid those error records importing in DataTable object.
  • Atlast import your CSV file into your table as BulkImport Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name";   //Assign table name
bulkCopy.WriteToServer(dtToBeImport);   //Write into Actual table.
}

Hope this will help you.