3
votes

I am pretty new to SSIS and we've got a client who's got pretty large data sets to deal with.

One table has over 190 million rows. Every month they replace this data with data from a CSV file (about 40GB in size). This was done as a step in an SSIS package.

Now they only want to replace data after a certain date (the smallest date in the CSV file), so instead of sending data all the way from 1997 they only send data from 2010 and onwards. The way we approached this was:

  1. Truncate "temp_table" (SQL Statement Task)
  2. Read all data from "data.csv" (Flat File Source)
    • Multicast
      1. Insert all data to "temp_table" (OLE DB Destination)
        • This is done in batches of 75k rows per batch
      2. Set a variable "minDate" to the smallest date value in "data.csv" (Script Component)
  3. Delete all rows in "destination_table" where the date value is larger than "minDate" (SQL Statement Task)
    • This is done in batches of 100k rows per batch
  4. Insert all rows from "temp_table" to "destination_table" (OLE DB Source -> OLE DB Destination)
    • This is done in batches of 75k rows per batch

Using this approach we are inserting data to "temp_table" at an average rate of 60k rows/second.

The deletion phase in "destination_table" deletes ~40k rows/second, and the second insert phase (from "temp_table" to "destination_table") operates at a speed of ~80k rows/second.

Is this speed average for this type of job? Is it too slow? What would you do to improve it?

TL;DR

We are inserting data to an MSSQL database table using SSIS at a rate of ~4-5 million rows per minute and deleting data at a rate of ~2.5 million rows per minute. Is that acceptable?

1
How about [CSV] ==(all rows)=> [Conditional Split (by date)] ==(matching rows)=> [OLEDB Destination]? - Tomalak
@Tomalak Thanks for the input! The problem is that the rows will not be matching in any way. It is "unknown" data with no identifiers whatsoever and 50 columns that can vary from the last point of insertion. There might be 5000 rows that have the column "referee" set to "abc123" in the data set from last month, and only 300 in the set from this month... The wanted result then is that the final table should only have 300 rows with "referee" set to "abc123". - Emil Larsson
Is it too slow? What is the SLA for the data refresh process? In your loading of your temp table, I can think of two optimizations I would try. The first is to replace the script task with the native Aggregation task. But that's only if you feel you must capture that value in there. My preference would be to load all the data to the table. I would then add an Execute SQL Task to find the minimum insert date. Test, see which is faster. If temp table and destination are on the instance, skip all of that and just write the SQL to delete and insert and put that in an Execute SQL task. - billinkc
if you want better tuning thoughts, show us the source file definition, temp and final destination table structure, the flat file connection manager's column definition. A sample of the data wouldn't hurt, there are some tricks we might be able to do there. When the package is running, what is it waiting on (disk, memory, cpu, network)? What version of SQL Server are you working with? Standard or Enterprise edition? - billinkc
In case you are deleting all rows from specific date, consider using partitions. Deleting a partition is as fast as truncating. It is an enterprise feature. - Ako

1 Answers

1
votes

The Rows/second average depends on each case. To know if it enough we have to know how your final table are configured which are the hardware configs of your database server...

To improve performance we always see first the I/O. If we can do something in write or do something in read...

In write we could do something like: Use a SSIS component called "Balanced Data Distributor", that uses the power of processor to make write in parallel. Make your table partitioned in multiples disks to write data simultaneously.

In read you could partition the file in multiples files to read simultaneously.

I don't know if I helped you... but that is what I think.