35
votes

I barely know a thing about compression, so bear with me (this is probably a stupid and painfully obvious question).

So lets say I have an XML file with a few tags.

<verylongtagnumberone>
  <verylongtagnumbertwo>
    text
  </verylongtagnumbertwo>
</verylongtagnumberone>

Now lets say I have a bunch of these very long tags with many attributes in my multiple XML files. I need to compress them to the smallest size possible. The best way would be to use an XML-specific algorithm which assigns individual tags pseudonyms like vlt1 or vlt2. However, this wouldn't be as 'open' of a way as I m trying to go for, and I want to use a common algorithm like DEFLATE or LZ. It also helpes if the archive was a .zip file.

Since I'm dealing with plain text (no binary files like images), I'd like an algorithm that suits plain text. Which one produces the smallest file size (lossless algorithms are preferred)?

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

EDIT: The 'encryption' thing was a typo; it should ave ben 'compression'.

8
How is this related to encryption? And the simple answer is to let ZIP do the compression: it's widely available, does a decent job on text, and it's not worth the time to find "the smallest size possible."kdgregory
Why not just use OpenXML? It's basically what you want :). Not sure if it's the best compression, but I'm liking it so far. And if you don't know it already, OpenXML is basically a zip file, so you can rename your Office 2007 documents as a .zip file (i.e. something.docx to something.zip) and open it as a zip file. Inside is basically bunch of XMLs.Jimmy Chandra
You could simply use a bunch of XML files in a zip file, and whatever file extension you want. Why the verylongtagnumbers???Osama Al-Maadeed
'ave' and 'ben' are typo's. 'encryption' instead of 'compression' is mistake.MrFox

8 Answers

34
votes

There is a W3 (not-yet-released) standard named EXI (Efficient XML Interchange).

Should become THE data format for compressing XML data in the future (claimed to be the last necessary binary format). Being optimized for XML, it compresses XML more ways more efficient than any conventional compression algorithm.

With EXI, you can operate on compressed XML data on the fly (without the need to uncompress or re-compress it).

EXI = (XML + XMLSchema) as binary.

And here you go with the opensource implementation (don't know if it's already stable):
Exificient

6
votes

Another alternative to "compress" XML would be FI (Fast Infoset).

XML, stored as FI, would contain every tag and attribute only once, all other occurrences are referencing the first one, thus saving space.

See:

Very good article on java.sun.com, and of course
the Wikipedia entry

The difference to EXI from the compression point of view is that Fast Infoset (being structured plaintext) is less efficient.

Other important difference is: FI is a mature standard with many implementations.
One of them: Fast Infoset Project @ dev.java.net

6
votes

Yes, *.zip best in practice. Gory deets contained in this USENIX paper showing that "optimal" compressors not worth computational cost & domain-specific compressors don't beat zip [on average].

Disclaimer: I wrote that paper, which has been cited 60+ times according to Google.

2
votes

It seems like you're more interested in compression rather than encryption. Is that the case? If so, this might prove an interesting read even though is not an exact solution.

1
votes

By the way, the scenario is this: I am creating a standard for documents, like ODF or MS Office XML, that contain XML files, packaged in a .zip.

then I'd suggest you use .zip compression, or your users will get confused.

0
votes

Your alternatives are:

  • Use a webserver that supports gzip compression. It'll auto compress all outgoing html. There's a small CPU penalty though.
  • Use something like JSON. It'll drastically reduce the size of the message
  • There's also a binary XML but I have not tried it myself.
0
votes

I hope I understood correctly what you need to do... First thing I would like to say is that there are no good or bad compression algorithmss for text - zip, bzip, gzip, rar, 7zip are good enough to compress anything that has a low entrpy - i.e. large file with small character set. If I would have to use them I would choose 7zip at my first choice, rar as a second and zip as third. But the difference is very small so you should try whatever easier for you. Second - I could not understand what you are trying to encrypt. Suppose that this is an XML file then you should first compress it using your favourite compression algorithm and then encrypt it using your favourite encryption algorithm. In most cases any modern algorithm implemented for instance in PGP will be secure enough for anything. Hope that helps.

0
votes

None of the default ones are ideal for XML but you will still get good values since there is a lot of repeatables.

Because XML uses a lot of repeats ( tags . > ) you want these be less than a bit so some form of arithmetic rather than Huffman encoding . So rar / 7zip should be significantly better in theory..these algorithms offer high compression so are slower. Ideally you'd want a simple compression with an arithmetic encoder ( which for XML would be fast and give high compression) .