16
votes

Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256. In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.

public static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

Then I read this topic and somehow change my code according what they said to :

public static string GetChecksumBuffered(Stream stream)
{
    using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(bufferedStream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

But It doesn't have such a affection and takes about 9 mins.

Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !

Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.

My Questions are :

  1. What causes such different between the above code and Linux sha256sum in time ?

  2. What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)

  3. Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?

  4. How can I make my implementation as fast as sha256sum in C#?

3
Is there a reason you can't call sha256sum from your code using a Process?Nate Diamond
@NateDiamond Yeah ! First this program must be running on Windows as well as linux. Second is as I mention in my question, I don't whether checksum is secure enough.(Or hash code)Mohammad Sina Karvandi
This should be completely throttled by the cost of reading the file off the disk. 9 minutes is not unthinkable but you'd need a cheap laptop with a crappy spindle drive and not enough RAM. Document what you use.Hans Passant
@HansPassant Actually I want to now is there any different between Hash Code of a file and Checksum of a file ?Mohammad Sina Karvandi

3 Answers

16
votes
public string SHA256CheckSum(string filePath)
{
    using (SHA256 SHA256 = SHA256Managed.Create())
    {
        using (FileStream fileStream = File.OpenRead(filePath))
            return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
    }
}
10
votes
  1. My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.

    Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.

  2. The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.

    • A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.
    • A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
  3. MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.

  4. No idea, beyond those mentioned above. You're doing it right.

For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?

Edit in response to question in comment

The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.

The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.

As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.

Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?

3
votes

Way late to the party but seeing as none of the answers mentioned it, I wanted to point out:

SHA256Managed is an implementation of the System.Security.Cryptography.HashAlgorithm class, and all of the functionality related to the read operations are handled in the inherited code.

HashAlgorithm.ComputeHash(Stream) uses a fixed 4096 byte buffer to read data from a stream. As a result, you're not really going to see much difference using a BufferedStream for this call.

HashAlgorithm.ComputeHash(byte[]) operates on the entire byte array, but it resets the internal state after every call, so it can't be used to incrementally hash a buffered stream.

Your best bet would be to use a third party implementation that's optimized for your use case.