3
votes

Note : I'm not trying to use SuperFastHash and expecting it to give the same output values as CRC32.

I'm writing a simple LZSS compression/decompression routine to provide very fast decompression and no memory overhead when decompressing. Input data is split into blocks of 4096-bytes length, and compressed sequentially.

My problem : I want to add some error detection for each compressed block (block size <= 4096 bytes). The time constraint is drastic, and therefore the checksum routine must be very fast. I avoided the cryptographic algorithms (MD5, SHA1) because they involve a lot of computations, and I chose CRC32 (a simpler and obvious algorithm).

After making some tests, I found CRC32 too slow regarding my project constraints. I used enwik9 (10^9 bytes text dump of wikipedia) from here. I compressed it using my LZSS routine and obtained a 570Mb file. I measured the following durations (single threaded, disk IO excluded, all data loaded in memory before processing, average of 10 trials) :

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|        Decompression          |        6.8 seconds       |      6.95 seconds      |
|  CRC32 on decompressed result |        4.9 seconds       |      4.62 seconds      |
|   CRC32 on compressed result  |        2.8 seconds       |      2.69 seconds      |

Then I tested SuperFastHash, just by curiosity :

|          Operation            |  Time (GCC4.4.5/Linux)   |  Time (MSVC2010/Win7)  |
|-------------------------------+--------------------------+------------------------|
|  SFH on decompressed result   |        1.1 seconds       |      1.33 seconds      |
|   SFH on compressed result    |        0.7 seconds       |      0.75 seconds      |

And here is my CRC32 implementation (I followed the descriptions from the following document : http://www.ross.net/crc/download/crc_v3.txt) :

# include <stdint.h>

// CRC32 lookup table (corresponding to the polynom 0x04C11DB7)
static const uint32_t  crc32_lookup_table[256] =
{
    0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
    0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
    0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
    // many lines skipped
    // ...
    0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D
} ;

uint32_t crc32_hash(const uint8_t * data, size_t len)
{
    uint32_t crc32_register = 0xFFFFFFFF ;
    while( len-- )
    {
        crc32_register = (crc32_register >> 8)
                       ^ crc32_lookup_table[(crc32_register & 0x000000FF) ^ *data++] ;
    }
    return crc32_register ^ 0xFFFFFFFF ;
}

My question is :

Can I use a hash instead of a cyclic redundancy check value to perform error detection in compressed data blocks ? As far as I know (and I remember from my electronics course), CRC algorithms are designed to be very efficient when errors occur in bursts when data is transmitted over a noisy channel, which is not the case of data read from hard drives. Please correct me if I'm wrong.

Thanks for any advice !

3
IF you're only guarding against accidental corruption of the compressed file, and checking the checksum on the decompressed data, it is likely that any odd checksum method will work well, since any error in the compressed data would tend to corrupt everything that follows it. Perhaps with the exception of a simple modular sum or xor of everything, but even something like Adler32 ought to do the job.hmakholm left over Monica
Thanks ! I've thought the same thing, but placing the error detection step after the decompression requires a very good error handling from the decompression routine when presented erroneous data. By error, I mean memory underflow/overflow that can occur when dealing with invalid LZ pos/len in the compressed block. If only I could get rid of the boundary tests within the main loop of the decompression routine, I can easily get 15% more speed ...overcoder
You'll need to handle memory overflow in any case for security reasons. Otherwise it would be easy for an attacker to create input that crashes (or possibly subverts) your program -- and any checksum that a legitimate compressor could produce for the compressed data could also be mimicked by hand by someone crafting an attack packet.hmakholm left over Monica

3 Answers

1
votes

Since your problem is not about security, you can use "broken" cryptographic hash functions, which are not secure against a sentient attacker, but still very good at detecting transmission errors. I am thinking about MD4, which has been measured to be faster than CRC32 on some platforms. You may also want to check RadioGatún and Panama; see this library for opensource implementations in C and Java of various cryptographic hash functions.

If your target architecture is a recent/big enough x86 CPU which features the AES-NI instructions, then you could make a devilishly fast and very good checksum by simply computing a CBC-MAC with block cipher AES and a conventional key (e.g. an all-zero key); since this is not for security, you could even use less rounds than standard AES (e.g. 5 rounds instead of the standard 10).

3
votes

SuperFastHash has been found to have some problems, along with fellow fast hashing function murmur2. If your looking for something tuned for bigger data blocks with low collisions, you can try the 128 bit variants of google's city hash (http://code.google.com/p/cityhash/ ) or murmur3. There are also some more off the wall ones like crap8 and crapwow which claim to provide almost perfect bit avalanching and funneling and thus have almost zero collisions, you can read up on it and other noncrypto hash functions here: http://www.team5150.com/~andrew/noncryptohashzoo/

1
votes

Hashes are designed to result in a great change of the outcome even with very small alterations on the input.

I think that the SuperFastHash has this propertie. It might be a bit more vulnerable to collisions (since it seems to be less analyzed by the community), but it shouldn't prevent the use that you have in mind.

Good luck :)