SSIS Performing Slowly Loading 20k Files

Question

I have built one SSIS package to load data from CSV files to a database table. The CSV files are first downloaded from Azure blob using a power shell script and then each of these files is loaded to a target table in SQL Server.

So I setup a ForEach Enumerator to loop through all the files and load data to the target table but the process is too slow. Each file has just one row of data (around 30 columns) and so to load say 20k rows I have to loop through 20k files and the package takes HOURS to run.

I tried looking for alternatives to load data from multiple files but couldn't find any concrete solution.One guy Hilmar has an interesting solution to use script task to improve performance but I don't have any C# know-how what so ever.

Has anyone run into a similar problem or overcome the same ? Or if anyone has a sample to load multiple files using a script task, it would help a lot (?)

Any help is appreciated.

Considered that having those files as they are is not smart. 20k files are a LOT of io operations - whatever you do, that is going to be lot slower than loading 200 files with a lot of lines each file. — TomTom
Just the fact that you have a lot of very small files is a big problem. Not for SSIS, for any tool, including Hadoop. Either change your script to download a single file, or concatenate the small files to one big file. You don't explain what your package does, but if you run a full process inside the iterator you are also wasting a lot of time. Use the iterator to load all data in a single staging table, then process the staged data in bulk — Panagiotis Kanavos
Also consider that these small files cause a serious waste of space, since each of them will use a full disk page (4KB). — Panagiotis Kanavos
I once had a similar problem and found out that two major performance losses were (1) opening all these files and (2) that a new connection is generated each time you import one of these files. The first of the two cannot be avoided but the time can be reduced by first making sure that all these files are locally available and not on a network drive (or some other remote location). For the second part I concatenated (in memory a bulk of aprox. 500 files before sending it to the server for import. So, 1 connection every 500 files. That helped a lot too. — Ralph
What about MULTIFLATFILE Connection Manager mentioned in the same article by Hilmar Buchta? You do have .csv(s), it should work for you. — Y.B.

Y.B. Y.B. · Accepted Answer · 2016-02-23T11:05:04

To conclude the comments conversation here is a script Merging multiple CSV files into one using PowerShell to load all the data in one go (assuming all the files are of the same format) with a tiny tweak to traverse subfolders and append caret return to the end of each file:

if (Test-Path "COMBINED_FILE.csv") {Remove-Item "COMBINED_FILE.csv"}

$getFirstLine = $true

Get-ChildItem "SOURCE_ROOT_FOLDER\*.csv" -Recurse -File | foreach {
    $filePath = $_.FullName

    $lines = Get-Content $filePath
    $linesToWrite = switch($getFirstLine) {
           $true  {$lines}
           $false {$lines | Select -Skip 1}
    } + [System.Environment]::NewLine

    $getFirstLine = $false
    Add-Content "COMBINED_FILE.csv" $linesToWrite
}

SSIS Performing Slowly Loading 20k Files

2 Answers