1
votes

I have built one SSIS package to load data from CSV files to a database table. The CSV files are first downloaded from Azure blob using a power shell script and then each of these files is loaded to a target table in SQL Server.

So I setup a ForEach Enumerator to loop through all the files and load data to the target table but the process is too slow. Each file has just one row of data (around 30 columns) and so to load say 20k rows I have to loop through 20k files and the package takes HOURS to run.

I tried looking for alternatives to load data from multiple files but couldn't find any concrete solution.One guy Hilmar has an interesting solution to use script task to improve performance but I don't have any C# know-how what so ever.

Has anyone run into a similar problem or overcome the same ? Or if anyone has a sample to load multiple files using a script task, it would help a lot (?)

Any help is appreciated.

2
Considered that having those files as they are is not smart. 20k files are a LOT of io operations - whatever you do, that is going to be lot slower than loading 200 files with a lot of lines each file.TomTom
Just the fact that you have a lot of very small files is a big problem. Not for SSIS, for any tool, including Hadoop. Either change your script to download a single file, or concatenate the small files to one big file. You don't explain what your package does, but if you run a full process inside the iterator you are also wasting a lot of time. Use the iterator to load all data in a single staging table, then process the staged data in bulkPanagiotis Kanavos
Also consider that these small files cause a serious waste of space, since each of them will use a full disk page (4KB).Panagiotis Kanavos
I once had a similar problem and found out that two major performance losses were (1) opening all these files and (2) that a new connection is generated each time you import one of these files. The first of the two cannot be avoided but the time can be reduced by first making sure that all these files are locally available and not on a network drive (or some other remote location). For the second part I concatenated (in memory a bulk of aprox. 500 files before sending it to the server for import. So, 1 connection every 500 files. That helped a lot too.Ralph
What about MULTIFLATFILE Connection Manager mentioned in the same article by Hilmar Buchta? You do have .csv(s), it should work for you.Y.B.

2 Answers

1
votes

To conclude the comments conversation here is a script Merging multiple CSV files into one using PowerShell to load all the data in one go (assuming all the files are of the same format) with a tiny tweak to traverse subfolders and append caret return to the end of each file:

if (Test-Path "COMBINED_FILE.csv") {Remove-Item "COMBINED_FILE.csv"}

$getFirstLine = $true

Get-ChildItem "SOURCE_ROOT_FOLDER\*.csv" -Recurse -File | foreach {
    $filePath = $_.FullName

    $lines = Get-Content $filePath
    $linesToWrite = switch($getFirstLine) {
           $true  {$lines}
           $false {$lines | Select -Skip 1}
    } + [System.Environment]::NewLine

    $getFirstLine = $false
    Add-Content "COMBINED_FILE.csv" $linesToWrite
}
0
votes

Kind of feel a little outwitted here. I deployed my package to Integration Services and scheduled a run for same via SQL Agent.

Guess what !! A package that took 12 hours to load 6k files, now loads 20k files in under 30 mins. Never ever would I have thought that executing a package in SSDT and executing it on server would have such contrasting results.

Now I am not sure as to what the exact reason for this could be but I guess the time wasted in logging all the execution results in SSDT could have made a big difference.

Will search for the exact reason for such behaviour but this significant decrease in execution time is acceptable to me as I don't have a large number of files to load data from every day.

Would have gone for concatenating the files option had I not needed the original files as we have added a mail task now to send files with errors(truncation/data) back to the dev team.

Thanks for the help though @everyone.