0
votes

What's the best recommended method/approach to check on the "file originality" of large quantities of files in a directory?

Considering large quantities of files needed to transfer from one site to another site, there may possibility that the files are corrupted or unauthorized modification during transferring process.

Currently I am using Last Modified Date to check on the file to verify whether the file still remain the "original" copies. I found that through file checksum(MD5/sha1) maybe a better approach compared to check on Last Modified Date on files.

  1. Is it by using file MD5 is best approach/method to check/verify the files? Or there is any better alternate method/approach?
  2. How about the performance side? Cpu intensive? By generating MD5/sha1 are efficient and quick enough to process large quantities of files? Will size of file affect the MD5 generating time taken?

Reference: https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/get-filehash?view=powershell-6

1
You sort of answered your own question: yes, hashing takes CPU time and that is proportional to the file length. If that is acceptable only you can answer, run a representative test on your hardware.bommelding
"1. How about the performance side? 2.Cpu intensive?", 1 depends on the file size and the Cpu. 2. Well it's either Cpu intensive or what? If you have a metric ton of little file for parallelisation, perhaps GPU.. Or do you have some Hashing Asic?Drag and Drop

1 Answers

1
votes

Last Modified Date can be altered at will to try to mask file manipulation. See HeyScriptingGuy for an example. You can hash files with fairly minimal computational power. As a test I ran the following:

Generate 100 files of 10Mb each.

Add-Type -AssemblyName System.Web
1..100 | %{
    #get a random filename in the present working directory
    $fn = [System.IO.Path]::Combine($pwd, [GUID]::NewGuid().ToString("N") + '.txt')
    #set number of iterations
    $count = 10mb/128
    #create a filestream
    $fs = New-Object System.IO.FileStream($fn,[System.IO.FileMode]::CreateNew)
    #create a streamwriter
    $sw = New-Object System.IO.StreamWriter($fs,[System.Text.Encoding]::ASCII,128)
    do{
         #Write the chars plus eol
         $sw.WriteLine([System.Web.Security.Membership]::GeneratePassword(126,0))
         #decrement the counter
         $count--
    }while($count -gt 0)
    #close the streamwriter
    $sw.Close()
    #close the filestream
    $fs.Close()
}

Create an array of supported algorithms. This is for PSv4, v5, v5.1, v6 dropped MACTripleDES and RipeMD160

$algorithms = @('MACTripleDES','MD5','RIPEMD160','SHA1','SHA256','SHA384','SHA512')

Hash all the files 10 times for each algorithm and get the average.

foreach($algorithm in $algorithms) {
    $totalTime = 0
    1..10 | ForEach-Object {
        $totalTime += measure-command {dir | Get-FileHash -Algorithm $algorithm} | Select-Object -ExpandProperty TotalSeconds
    }
    [PSCustomObject]@{
        Algorithm = $algorithm
        AverageTime = $totaltime/10
    }
}

Results for 100 10Mb files

Algorithm    AverageTime
---------    -----------
MACTripleDES 42.44803523
MD5           3.50319849
RIPEMD160     9.58865946
SHA1          3.94368417
SHA256        7.72123966
SHA384        5.61478894
SHA512        5.62492335

Results for 10 100Mb files

Algorithm    AverageTime
---------    -----------
MACTripleDES 43.82652673
MD5           3.40958188
RIPEMD160     9.25260835
SHA1          3.74736209
SHA256        7.19778535
SHA384        5.17364897
SHA512        5.17803741

I would recommend running a similar benchmark on your systems to see what the impact would be. Also from the documentation: "For security reasons, MD5 and SHA1, which are no longer considered secure, should only be used for simple change validation, and should not be used to generate hash values for files that require protection from attack or tampering."