0
votes

I need to copy files dependent on content. So I get all files, read the content and ask a regex if it is valid. After that I want to copy the file to a certain directory. My problem is, that there are a lot of source files, so I need to execute this in parallel.

I cannot use PowerShell ForEach-Object Parallel Feature because we are using Powershell Version < 7.0. Using a workflow is way to slow.

$folder = "C:\InputFiles"
workflow CopyFiles
{
    foreach -parallel ($file in gci $folder *.* -rec | where { ! $_.PSIsContainer })
    {
        //Get content and compare against a regex
        //Copy if regex matches
    }
}
CopyFiles

Any ideas how to run this in a parallel manner with Powershell?

2
is the regex to match is same for all files? - Kundan
There are two regexes, either the first or the second must match, then the files need to be copied. - Lori
If jobs and workflows are still to slow for you, you might want to switch away from an interpreter language (PowerShell) to a more low-level implementation in C/C++. Besides that, did you investigate, what is slowing you down? It might be the part where you get the content and compare it. Is it necessary to read the whole files? And how do you read the content? Get-Content will be a bad choice if you want the highest speed. - stackprotector
Thanks for the feedback. Yes in c# it is about 5 lines code and execution time in my case ~30 seconds. I have not yet investigated what makes it slow, but in fact I am using Get-Content. Yes I need to read the whole file, but it would be also possible line by line and stop if a regex will match until the last line. Do you have an example for that? - Lori

2 Answers

0
votes

Another option is using jobs. You'd have to define a ScriptBlock accepting path and regex as parameters, then run it in paralell in the background. Read about Start-Job, Receive-Job, Get-Job, Remove-Job cmdlets. But I don't think it's really going to help:

  • I don't expect it to be much faster than workflows
  • You'd have to throttle and control execution of jobs by yourself adding complexity to the script
  • There's substantial overhead to running jobs
  • Most probably file system is the bottleneck of this task, so any approach accessing files in paralell isn't really going to help here
0
votes

Can you run the following script with your configuration and see how much time it takes with this method? It takes 100ms for me to find around 2000 occurrences of text PowerShell in it.

$starttime = Get-Date;
$RegEx = 'Powershell'
$FilesFound = Get-ChildItem -Path "$PSHOME\en-US\*.txt" | Select-String -Pattern $RegEx
Write-Host "Total occurence found: $($FilesFound.Count)"
$endtime = Get-Date;

Write-Host "Time of execution:" ($endtime - $starttime).Milliseconds "Mili Seconds";