I am pretty new to SSIS and we've got a client who's got pretty large data sets to deal with.
One table has over 190 million rows. Every month they replace this data with data from a CSV file (about 40GB in size). This was done as a step in an SSIS package.
Now they only want to replace data after a certain date (the smallest date in the CSV file), so instead of sending data all the way from 1997 they only send data from 2010 and onwards. The way we approached this was:
- Truncate "temp_table" (SQL Statement Task)
- Read all data from "data.csv" (Flat File Source)
- Multicast
- Insert all data to "temp_table" (OLE DB Destination)
- This is done in batches of 75k rows per batch
- Set a variable "minDate" to the smallest date value in "data.csv" (Script Component)
- Insert all data to "temp_table" (OLE DB Destination)
- Multicast
- Delete all rows in "destination_table" where the date value is larger than "minDate" (SQL Statement Task)
- This is done in batches of 100k rows per batch
- Insert all rows from "temp_table" to "destination_table" (OLE DB Source -> OLE DB Destination)
- This is done in batches of 75k rows per batch
Using this approach we are inserting data to "temp_table" at an average rate of 60k rows/second.
The deletion phase in "destination_table" deletes ~40k rows/second, and the second insert phase (from "temp_table" to "destination_table") operates at a speed of ~80k rows/second.
Is this speed average for this type of job? Is it too slow? What would you do to improve it?
TL;DR
We are inserting data to an MSSQL database table using SSIS at a rate of ~4-5 million rows per minute and deleting data at a rate of ~2.5 million rows per minute. Is that acceptable?
Is it too slow?What is the SLA for the data refresh process? In your loading of your temp table, I can think of two optimizations I would try. The first is to replace the script task with the native Aggregation task. But that's only if you feel you must capture that value in there. My preference would be to load all the data to the table. I would then add an Execute SQL Task to find the minimum insert date. Test, see which is faster. If temp table and destination are on the instance, skip all of that and just write the SQL to delete and insert and put that in an Execute SQL task. - billinkc