Multiple data flow tasks vs one data flow task with many source-destinations

Question

I am building a simple SSIS package that truncates 10 tables and fills them up again. The source data is on one Server and the destination is a SQL Server. The data passes through another Server where the SSIS package get executed, since I do not have the permissions to execute SSIS packages from the destination SQL Server.

Question: should I create 10 separate data flow tasks or one data flow task containing 10 source/destinations?

My reasoning is that since the data passes though the server which executes the SSIS package then it might use all the memory on that server since all the data are being transformed simultaneously...

SSIS has a buffer of only 10MB by default. Have a look at this that explains the memory usage: sqlshack.com/understanding-ssis-memory-usage — Jayvee

billinkc billinkc · Accepted Answer · 2021-04-17T15:53:26

Create 11 packages.

10 implementing the truncate and reload pattern. Here's a replicate-o-matic Biml Pattern
1 implementing an orchestrator pattern

If it's truly a truncate and reload, that's extract and load, no transformation required. The amount of data in the data buffer (in memory) will be approximately constant for the whole run but it will not allocate enough data to hold all the source data. It'll extract data in a batch size from the source, do whatever transformations are specified - none in your case, and then push it to the destination. If there's more data to be had, the same allocated memory is reused with the subsequent batches.

Your orchestrator pattern then allows you to structure the execution however makes sense. If your concern is minimizing memory, then you likely want the sequence execution.

You could run all packages in sequence - this would result in steady memory usage but a longer total duration.
You could run the packages all in parallel - shorter total duration but a much higher burst of memory usage.
My personal preference is to use a combination of parallel and sequence. I have 3, 5, whatever foreach loop containers that will all run in parallel. They enumerate through the work to be done which I will have segmented beforehand (typically, there's a few big tables and they might be the only member of a processing stream and then lots of little tables could be in another). The purpose of this is to try and get the total processing across all the parallel operators to do a roughly equivalent workload.

Multiple data flow tasks vs one data flow task with many source-destinations

1 Answers