GZipStream doesn't detect corrupt data (even CRC32 passes)?

Question

I'm using GZipStream to compress / decompress data. I chose this over DeflateStream since the documentation states that GZipStream also adds a CRC to detect corrupt data, which is another feature I wanted. My "positive" unit tests are working well in that I can compress some data, save the compressed byte array and then successfully decompress it again. The .NET GZipStream compress and decompress problem post helped me realize that I needed to close the GZipStream before accessing the compressed or decompressed data.

Next, I continued to write a "negative" unit test to be sure corrupt data could be detected. I had previously used the example for the GZipStream class from MSDN to compress a file, open the compressed file with a text editor, change a byte to corrupt it (as if opening it with a text editor wasn't bad enough!), save it and then decompress it to be sure that I got an InvalidDataException as expected.

When I wrote the unit test, I picked an arbitrary byte to corrupt (e.g., compressedDataBytes[50] = 0x99) and got an InvalidDataException. So far so good. I was curious, so I chose another byte, but to my surprise I did not get an exception. This may be okay (e.g., I coincidentally hit an unused byte in a block of data), so long as the data could still be recovered successfully. However, I didn't get the correct data back either!

To be sure "it wasn't me", I took the cleaned up code from the bottom of .NET GZipStream compress and decompress problem and modified it to sequentially corrupt each byte of the compressed data until it failed to decompress properly. Here's the changes (note that I'm using the Visual Studio 2010 test framework):

// successful compress / decompress example code from:
//    https://stackguides.com/questions/1590846/net-gzipstream-compress-and-decompress-problem
[TestMethod]
public void Test_zipping_with_memorystream_and_corrupting_compressed_data()
{
   const string sample = "This is a compression test of microsoft .net gzip compression method and decompression methods";
   var encoding = new ASCIIEncoding();
   var data = encoding.GetBytes(sample);
   string sampleOut = null;
   byte[] cmpData;

   // Compress 
   using (var cmpStream = new MemoryStream())
   {
      using (var hgs = new GZipStream(cmpStream, CompressionMode.Compress))
      {
         hgs.Write(data, 0, data.Length);
      }
      cmpData = cmpStream.ToArray();
   }

   int corruptBytesNotDetected = 0;

   // corrupt data byte by byte
   for (var byteToCorrupt = 0; byteToCorrupt < cmpData.Length; byteToCorrupt++)
   {
      // corrupt the data
      cmpData[byteToCorrupt]++;

      using (var decomStream = new MemoryStream(cmpData))
      {
         using (var hgs = new GZipStream(decomStream, CompressionMode.Decompress))
         {
            using (var reader = new StreamReader(hgs))
            {
               try
               {
                  sampleOut = reader.ReadToEnd();

                  // if we get here, the corrupt data was not detected by GZipStream
                  // ... okay so long as the correct data is extracted
                  corruptBytesNotDetected++;

                  var message = string.Format("ByteCorrupted = {0}, CorruptBytesNotDetected = {1}",
                     byteToCorrupt, corruptBytesNotDetected);

                  Assert.IsNotNull(sampleOut, message);
                  Assert.AreEqual(sample, sampleOut, message);
               }
               catch(InvalidDataException)
               {
                  // data was corrupted, so we expect to get here
               }
            }
         }
      }

      // restore the data
      cmpData[byteToCorrupt]--;
   }
}

When I run this test, I get:

Assert.AreEqual failed. Expected:<This is a compression test of microsoft .net gzip compression method and decompression methods>. Actual:<>. ByteCorrupted = 11, CorruptBytesNotDetected = 8

So, this means there were actually 7 cases where corrupting the data made no difference (the string was successfully recovered), but corrupting byte 11 neither threw an exception, nor recovered the data.

Am I missing something or doing soemthing wrong? Can anyone see why the corrupt compressed data is not being detected?

Does GZip have a CRC? (There might be many "corrupt bytes" that can't be detected while decompressing, as they might still result in a "valid" or "not currently yet invalid" stream.) — user166390
The doc's say it does have a CRC. Nonetheless, some bytes can be corrupted and the data is recovered successfully (which is okay... the 7 cases noted above). However, if it doesn't throw an exception and it doesn't recover the data properly, then I think this is a problem (which is where the iterations in the test stop). — Andrew Teare
Check the exceptions that are thrown: it's not always InvalidDataException — user166390
However, I can't explain why the CRC32 doesn't cause an exception to be thrown in all cases (it should be much more reliable than it seems) even though it does detect CRC errors in, say, a measly 60-80% of the invalid cases "over 11" that don't cause a Huffman lookup exception or other internal decoding error. I was under the impression that CRC32 should result in detection rates of over 99.999% (and then some), which does not seem to be the case here :- — user166390
I'm only catching InvalidDataException, so any other would cause the test to fail because of that exception. However, the test only fails because no exception was thrown and the decompressed data is not as expected. Thanks for the idea though. I still feel like I'm missing something, but the test seems pretty straight forward. — Andrew Teare

Mark Adler Mark Adler · Accepted Answer · 2012-02-27T05:17:35

There is a 10-byte header in the gzip format, for which the last seven bytes can be changed without resulting in a decompression error. So the seven cases you noted with no corruption are expected.

It should be vanishingly rare to not detect an error with a corruption anywhere else in the stream. Most of the time the decompressor will detect an error in the format of the compressed data, never even getting to the point of checking the crc. If it does get to the point of checking a crc, that check should fail very nearly all the time with a corrupted input stream. ("Nearly all the time" means a probability of about 1 - 2^-32.)

I just tried it (in C with zlib) using your sample string, which produces an 84-byte gzip stream. Incrementing each of the 84 bytes leaving the remainder the same, as you did, resulted in: two incorrect header checks, one invalid compression method, seven successes, one invalid block type, four invalid distances set, seven invalid code lengths set, four missing end-of-block, 11 invalid bit length repeat, three invalid bit length repeat, two invalid bit length repeat, two unexpected end of stream, 36 incorrect data check (that's the actual CRC error), and four incorrect length check (another check in the gzip format for the correct uncompressed data length). In no cases was a corrupted compressed stream not detected.

So there must be a bug somewhere, either in your code or in the class.

Update:

It appears that there is a bug in the class.

Remarkably (or maybe not remarkably), Microsoft has concluded that they won't fix this bug!

GZipStream doesn't detect corrupt data (even CRC32 passes)?

1 Answers