3
votes

I'm trying to implement file compression to an application. The application has been around for a while, so it needs to be able to read uncompressed documents written by previous versions. I expected that DeflateStream would be able to process an uncompressed file, but for GZipStream I get the "The magic number in GZip header is not correct" error. For DeflateStream I get "Found invalid data while decoding". I guess it does not find the header that marks the file as the type it is.

If it's not possible to simply process an uncompressed file, then 2nd best would be to have a way to determine whether a file is compressed, and choose the method of reading the file. I've found this link: http://blog.somecreativity.com/2008/04/08/how-to-check-if-a-file-is-compressed-in-c/, but this is very implementation specific, and doesn't feel like the right approach. It can also provide false positives (I'm sure this would be rare, but it does indicate that it's not the right approach).

A 3rd option I've considered is to attempt using DeflateStream, and fallback to normal stream IO if an exception occurs. This also feels messy, and causes VS to break at the exception (unless I untick that exception, which I don't really want to have to do).

Of course, I may simply be going about it the wrong way. This is the code I've tried in .Net 3.5:

Stream reader = new FileStream(fileName, FileMode.Open, readOnly ? FileAccess.Read : FileAccess.ReadWrite, readOnly ? FileShare.ReadWrite : FileShare.Read);

using (DeflateStream decompressedStream = new DeflateStream(reader, CompressionMode.Decompress))
{
    workspace = (Workspace)new XmlSerializer(typeof(Workspace)).Deserialize(decompressedStream);

    if (readOnly)
    {
        reader.Close();
        workspace.FilePath = fileName;
    }
    else
        workspace.SetOpen(reader, fileName);
}

Any ideas?

Thanks! Luke.

2
How about using a different file name or extension for your new file format?Thilo
I guess that would work (acts as a flag), but would rather avoid if poss, so that the end user doesn't have to know about more than one file extension for their documents. I'd also be interested to know if anyone has any other solutions, just to know whether this is a limitation of .Net, or whether I'm doing something wrong. Thanks though!Luke
"Trying to determine" if the file is compressed (through entropy, checking for non-ascii characters, whatever) is asking for trouble. You need a proper file header. Compressing your entire file means it's not an XML document anymore (and old versions of your app will bard trying to read it), so there's no reason you can't add a header :)snemarch

2 Answers

1
votes

Can't you just create a wrapper class/function for reading the file and catch the exception? Something like

try
{
    // Try return decompressed stream 
}
catch(InvalidDataException e)
{
    // Assume it is already decompressed and return it as it is
}
1
votes

Doesn't your file format have a header? If not, now is the time to add one (you're changing the file format by supporting compression, anyway). Pick a good magic value, make sure the header is extensible (add a version field, or use specific magic values for specific versions), and you're ready to go.

Upon loading, check for the magic value. If not present, use your current legacy loading routines. If present, the header will tell you whether the contents are compressed or not.

Update

Compressing the stream means the file is no longer an XML document, and thus there's not much reason to expect the file can't contain more than your data stream. You really do want a header identifying your file :)

The below is example (pseudo)-code; I don't know if .net has a "substream", SubRangeStream is likely something you'll have to code yourself (DeflateStream probably adds it's own header, so a substream might not be necessary; could turn out useful further down the road, though).

Int64 oldPosition = reader.Position;
reader.Read(magic, 0, magic.length);
if(IsRightMagicValue(magic))
{
    Header header = ReadHeader(reader);
    Stream furtherReader = new SubRangeStream(reader, reader.Position, header.ContentLength); 
    if(header.IsCompressed)
    {
        furtherReader = new DeflateStream(furtherReader, CompressionMode.Decompress);
    }

    XmlSerializer xml = new XmlSerializer(typeof(Workspace));
    workspace = (Workspace) xml.Deserialize(furtherReader); 
} else
{
    reader.Position = oldPosition;
    LegacyLoad(reader);
}

In real-life, I would do things a bit differently - some proper error handling and cleanup, for instance. Also, I wouldn't have the new loader code directly in the IsRightMagicValue block, but rather I'd spin off the work either based on the magic value (one magic value per file version), or I would keep a "common header" portion with fields common to all versions. For both, I'd use a Factory Method to return an IWorkspaceReader depending on the file version.