Note : I'm not trying to use SuperFastHash and expecting it to give the same output values as CRC32.
I'm writing a simple LZSS compression/decompression routine to provide very fast decompression and no memory overhead when decompressing. Input data is split into blocks of 4096-bytes length, and compressed sequentially.
My problem : I want to add some error detection for each compressed block (block size <= 4096 bytes). The time constraint is drastic, and therefore the checksum routine must be very fast. I avoided the cryptographic algorithms (MD5, SHA1) because they involve a lot of computations, and I chose CRC32 (a simpler and obvious algorithm).
After making some tests, I found CRC32 too slow regarding my project constraints. I used enwik9 (10^9 bytes text dump of wikipedia) from here. I compressed it using my LZSS routine and obtained a 570Mb file. I measured the following durations (single threaded, disk IO excluded, all data loaded in memory before processing, average of 10 trials) :
| Operation | Time (GCC4.4.5/Linux) | Time (MSVC2010/Win7) | |-------------------------------+--------------------------+------------------------| | Decompression | 6.8 seconds | 6.95 seconds | | CRC32 on decompressed result | 4.9 seconds | 4.62 seconds | | CRC32 on compressed result | 2.8 seconds | 2.69 seconds |
Then I tested SuperFastHash, just by curiosity :
| Operation | Time (GCC4.4.5/Linux) | Time (MSVC2010/Win7) | |-------------------------------+--------------------------+------------------------| | SFH on decompressed result | 1.1 seconds | 1.33 seconds | | SFH on compressed result | 0.7 seconds | 0.75 seconds |
And here is my CRC32 implementation (I followed the descriptions from the following document : http://www.ross.net/crc/download/crc_v3.txt) :
# include <stdint.h>
// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t crc32_lookup_table[256] =
{
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
// many lines skipped
// ...
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;
uint32_t crc32_hash(const uint8_t * data, size_t len)
{
uint32_t crc32_register = 0xFFFFFFFF ;
while( len-- )
{
crc32_register = (crc32_register >> 8)
^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
}
return crc32_register ^ 0xFFFFFFFF ;
}
My question is :
Can I use a hash instead of a cyclic redundancy check value to perform error detection in compressed data blocks ? As far as I know (and I remember from my electronics course), CRC algorithms are designed to be very efficient when errors occur in bursts when data is transmitted over a noisy channel, which is not the case of data read from hard drives. Please correct me if I'm wrong.
Thanks for any advice !