I would recommend loading a staging table on the destination server and then merge the results into a target table on the destination server. If you need to run any hygiene rules, then you could do this via stored procedure since you are bound to get better performance than through SSIS data flow transformation tasks. Besides, deduping is generally a multi-step process. You may want to dedupe on:
- Distinct lines.
- Distinct groups of columns like First Name, Last Name, Email Address, etc.
- You may want to dedupe against an existing target table. If that's the case, then you may need to include NOT EXISTS or NOT IN statements. Or you may want to update the original row with new values. This usually is best served with a MERGE statement and a subquery for the source.
- Take the first or last row of a particular pattern. For instance, you may want the last row entered in the file for each occurrence of an email address or phone number. I usually rely on CTE's with ROW_NUMBER() to generate sequential order and reverse order columns like in the folling sample:
.
WITH
sample_records
( email_address
, entry_date
, row_identifier
)
AS
(
SELECT '[email protected]'
, '2009-10-08 10:00:00'
, 1
UNION ALL
SELECT '[email protected]'
, '2009-10-08 10:00:01'
, 2
UNION ALL
SELECT '[email protected]'
, '2009-10-08 10:00:02'
, 3
UNION ALL
SELECT '[email protected]'
, '2009-10-08 10:00:00'
, 4
UNION ALL
SELECT '[email protected]'
, '2009-10-08 10:00:00'
, 5
)
, filter_records
( email_address
, entry_date
, row_identifier
, sequential_order
, reverse_order
)
AS
(
SELECT email_address
, entry_date
, row_identifier
, 'sequential_order' = ROW_NUMBER() OVER (
PARTITION BY email_address
ORDER BY row_identifier ASC)
, 'reverse_order' = ROW_NUMBER() OVER (
PARTITION BY email_address
ORDER BY row_identifier DESC)
FROM sample_records
)
SELECT email_address
, entry_date
, row_identifier
FROM filter_records
WHERE reverse_order = 1
ORDER BY email_address;
There are lots of options for you on deduping files, but ultimately I recommend handling this in a stored procedure once you have loaded a staging table on the destination server. After you cleanse the data, then you can either MERGE or INSERT into your final destination.