When is it appropriate to use CRC for error detection versus more modern hashing functions such as MD5 or SHA1? Is the former easier to implement on embedded hardware?
13 Answers
CRC works fine for detecting random errors in data that might occur, for example, from network interference, line noise, distortion, etc.
CRC is computationally much less complex than MD5 or SHA1. Using a hash function like MD5 is probably overkill for random error detection. However, using CRC for any kind of security check would be much less secure than a more complex hashing function such as MD5.
And yes, CRC is much easier to implement on embedded hardware, you can even get different packaged solutions for this on IC.
CRC is designed against unintentional changes in the data. That is, it's good for detecting unintentional errors, but will be useless as a way of making sure a data was not maliciously handled.
Also see this.
I found a study that shows how inappropriate CRC hashes are for hash tables. It also explains the actual characteristics of the algorithm. The study also includes evaluation of other hash algorithms and is a good reference to keep.
UPDATE
It seems the site is down. The internet archive has a copy though.
UPDATE 2
Oh dear. It turns out the study may have been faulty around the conclusions on CRC for use as a hash. Thanks @minexew for the link.
I ran every line of this PHP code in 1.000.000 loop. Results are in comments (#).
hash('crc32', 'The quick brown fox jumped over the lazy dog.');# 750ms 8 chars
hash('crc32b','The quick brown fox jumped over the lazy dog.');# 700ms 8 chars
hash('md5', 'The quick brown fox jumped over the lazy dog.');# 770ms 32 chars
hash('sha1', 'The quick brown fox jumped over the lazy dog.');# 880ms 40 chars
hash('sha256','The quick brown fox jumped over the lazy dog.');# 1490ms 64 chars
hash('sha384','The quick brown fox jumped over the lazy dog.');# 1830ms 96 chars
hash('sha512','The quick brown fox jumped over the lazy dog.');# 1870ms 128 chars
My conclusion:
- Use "crc32b" when you need http://en.wikipedia.org/wiki/Cyclic_redundancy_check and you do not care about security.
Use "sha256" (or higher) when you need added security layer.
Do not use "md5" or "sha1" because they have:
- some security issues when you care about security
- longer hash string and are slower than "crc32b" when all you need is CRC
For CRC information on implementation, speed and reliability see A painless guide to CRC error detection algorithms. It has everything on CRCs.
Unless somebody is going to try and modify your data maliciously and hide the change CRC is sufficient. Just use a "Good" (standard) polinomial.
It all depends on your requirements and expectation.
Here are quick brief differences between these hash function algorithms:
CRC (CRC-8/16/32/64)
- is not a cryptographic hashing algorithm (it's using a linear function based on cyclic redundancy checks)
- can produce either 9, 17, 33 or 65 bits
- not intended to be used for cryptographic purposes since makes no cryptographic guarantees,
- unsuitable for use in digital signatures, because it's easily reversible2006,
- should not be used for encryption purposes,
- different strings can generate the collision,
- invented in 1961 and used in Ethernet and many other standards,
MD5
- is a cryptographic hash algorithm,
- producing a 128-bit (16-byte) hash value (32 digit hexadecimal numbers)
- it is a cryptographic hash, but is considered deprecated if you worry about security,
- there are known strings which have the same MD5 hash value
- can be used for encryption purposes,
SHA-1
is a cryptographic hash algorithm,
produces a 160-bit (20-byte) hash value known as a message digest
it is a cryptographic hash and since 2005 it's no longer considered secure,
can be used for encryption purposes,
first published in 1993 (as SHA-0), then 1995 as SHA-1,
series: SHA-0, SHA-1, SHA-2, SHA-3,
In summary, using SHA-1 is no longer considered secure against well-funded opponents, because in 2005, cryptanalysts found attacks on SHA-1 which suggests it may be not secure enough for ongoing useschneier. U.S. NIST advise that federal agencies should stop using SHA1-1 for application which require collision resistance and must use SHA-2 after 2010NIST.
Therefore, if you're looking for simple and quick solution for checking the integrity of a files (against the corruption), or for some simple caching purposes in terms of performance, you can consider CRC-32, for hashing you may consider to use MD5, however if you're developing professional application (which should be secure and consistent), to avoid any collision probabilities - use SHA-2 and above (such as SHA-3).
Performance
Some simple benchmark test in PHP:
# Testing static text.
$ time php -r 'for ($i=0;$i<1000000;$i++) crc32("foo");'
real 0m0.845s
user 0m0.830s
sys 0m0.008s
$ time php -r 'for ($i=0;$i<1000000;$i++) md5("foo");'
real 0m1.103s
user 0m1.089s
sys 0m0.009s
$ time php -r 'for ($i=0;$i<1000000;$i++) sha1("foo");'
real 0m1.132s
user 0m1.116s
sys 0m0.010s
# Testing random number.
$ time php -r 'for ($i=0;$i<1000000;$i++) crc32(rand(0,$i));'
real 0m1.754s
user 0m1.735s
sys 0m0.012s\
$ time php -r 'for ($i=0;$i<1000000;$i++) md5(rand(0,$i));'
real 0m2.065s
user 0m2.042s
sys 0m0.015s
$ time php -r 'for ($i=0;$i<1000000;$i++) sha1(rand(0,$i));'
real 0m2.050s
user 0m2.021s
sys 0m0.015s
Related:
You do not say what it is that you are trying to protect.
A CRC is often used in embedded systems as a check against accidental data corruption as opposed to preventing malicious system modification. Examples of the places where a CRC can be useful is to validate an EPROM image during system initialisation to guard against firmware corruption. The system bootloader will calculate the CRC for the application code and compare with the stored value before allowing the code to run. This protects against the possibility of accidental program corruption or a failed download.
A CRC can also be used in a similar manner to protect configuration data stored in FLASH or EEPROM. If the CRC is incorrect then the data can be flagged as invalid and a default or backup data set used. The CRC may be invalid due to device failure or if the user removed power during an update of the configuration data store.
There have been comments that a hash provides greater probability of detecting corruption than a CRC with multiple bit errors. This is true, and the decision on whether or not to use a 16 or 32 bit CRC will hinge upon the safety consequences of a corrupted data block being used and whether you can justify the 1 in 2^16 or 2^32 chance of a data block being incorrectly declared valid.
Many devices have a built in CRC generator for standard algorithms. The MSP430F5X series from Texas have a hardware implementation of the CRC-CCITT Standard.
I came across a use of CRC recently which was smart. The author of the jdupe file duplication identification and removal tool (the same author of the popular exif tool jhead) uses it during the first pass through the files. A CRC is computed on the first 32K of each file to mark files that appear to be the same, also the files must have the same size. These files are added to a list of files on which to do a full binary comparison. It speeds up checking large media files.
Only use CRC if computation resources are very tight (i.e. some embed environments) or you need to store/transport many output values and space/bandwidth is tight (as CRCs are usually 32-bit where an MD5 output is 128-bit, SHA1 160 bit, and other SHA variants up to 512 bit).
Never use CRC for security checks as a CRC is very easy to "fake".
Even for accidental error detection (rather than malicious change detection) hashes are better than a simple CRC. Partly because of the simple way a CRC is calculated (and partly because CRC values are usual shorter than common hash outputs so have a much smaller range of possible values) it is much more likely that, in a situation where there are two or more errors, one error will mask another so you end up with the same CRC despite two errors.
In short: unless you have reason not to use a decent hash algorithm, avoid simple CRCs.
Lets start with the basics.
In Cryptography, a hashing algorithm converts many bits to fewer bits through a digest operation. Hashes are used to confirm integrity of messages and files.
All hashing algorithms generate collisions. A collision is when several many-bit combinations produce the same fewer bit output. The cryptographic strength of a hashing algorithm is defined by the inability for an individual to determine what the output is going to be for a given input because if they could they could construct a file with a hash that matches a legitimate file and compromise the assumed integrity of the system. The difference between CRC32 and MD5 is that MD5 generates a larger hash that's harder to predict.
When you want to implement message integrity - meaning the message hasn't been tampered with in transit - the inability to predict collisions is an important property. A 32-bit hash can describe 4 billion different messages or files using 4 billion different unique hashes. If you have 4 billion and 1 files, you are guaranteed to have 1 collision. 1 TB Bitspace has the possibility for Billions of Collisions. If I'm an attacker and I can predict what that 32 bit hash is going to be, I can construct an infected file that collides with the target file; that has the same hash.
Additionally if I'm doing 10mbps transmission then the possibility of a packet getting corrupted just right to bypass crc32 and continue along the to the destination and execute is very low. Lets say at 10mbps I get 10 errors\second. If I ramp that up to 1gbps, now I'm getting 1,000 errors per second. If I ram up to 1 exabit per second, then I have an error rate of 1,000,000,000 errors per second. Say we have a collision rate of 1\1,000,000 transmission errors, Meaning 1 in a million transmission errors results in the corrupt data getting through undetected. At 10mbps I'd get error data being sent every 100,000 seconds or about once a day. At 1gbps it'd happen once every 5 minutes. At 1 exabit per second, we're talking several times a second.
If you pop open Wireshark you'll see your typical Ethernet header has a CRC32, your IP header has a CRC32, and your TCP Header has a CRC32, and that's in addition to the what the higher layer protocols may do; e.g. IPSEC might use MD5 or SHA for integrity checking in addition to the above. There are several layers of error checking in typical network communications, and they STILL goof now and again at sub 10mbps speeds.
Cyclic Redundancy Check (CRC) has several common versions and several uncommon but generally is designed to just tell when a message or file has been damaged in transit (multiple bits flipping). CRC32 by itself is not a very good error checking protocol by today's standards in large, scalar enterprise environments because of the collision rate; the average users hard-drive can have upwards of 100k files, and file-shares on a company can have tens of millions. The ratio of hash-space to the number of files is just too low. CRC32 is computationally cheap to implement whereas MD5 isn't.
MD5 was designed to stop intentional use of collisions to make a malicious file look benign. It's considered insecure because the hashspace has been sufficiently mapped to enable some attacks to occur, and some collisions are predictable. SHA1 and SHA2 are the new kids on the block.
For file verification, Md5 is starting to be used by a lot of vendors because you can do multigigabyte files or multiterrabyte files quickly with it and stack that on top of the general OS's use and support of CRC32's. Do not be surprised if within the next decade filesystems start using MD5 for error checking.