Is it possible to 'prime' a zlib compressor (or some other open source compression engine) by feeding it a set of common strings, to increase the efficiency of compressing large numbers of very similar text packets one by one?
I'm trying to improve my scheme for logging millions of XML packets which are not only highly redundant but also very similar. Usually the number of bytes changed between messages is less than one percent. However, one of the objectives for the logging is troubleshooting half-baked client apps. Which is why I can't simply go and normalise the messages or extract only the salient information: the messages must be logged exactly as they came over the wire, byte for byte.
Currently the only way of exploiting redundancy between messages is to bundle a good number of them together into a single compressed packet, say 100 or 1000, or a whole day's worth. However, that would make the logging logic way too complicated for my taste und much less robust. Not to mention the difficulties arising from concurrent processes and random access to specific messages.
Which is why I thought that I could take some stream compressor and feed it a bunch of common strings P to get compressed text ZP, then work out the stable prefix by feeding it P + message[i] for some i and comparing the compressed results to ZP. What goes into the database would be the compressed text without the common prefix, and the known common prefix would then be re-added before decompressing. After decompressing I would take the part after the common prefix P, obviously.
Some tests indicate that the gain in compression ratio would be one or two orders of magnitude for smaller messages, but unfortunately such a trick doesn't work with the zlib deflate method...
Are there other ways of getting similar improvements (storage requirements slashed by orders of magnitude) without the hassle of the message bundling method mentioned above? Ideally the interface ought to be just foo_deflate(text) and foo_inflate(compressed_text), with all the trickery hidden inside the implementation of those two functions. I'm not afraid of whipping out a compiler and getting dirty, but all of the complexity must be confined to the compression module. In other words, the only acceptable interface change is the name change for the deflate/inflate functions. The bundling method does not meet that requirement and adds a bunch of ulterior complications.
Here's an example of what the messages look like, reformatted for readability and slightly hacked to protect the guilty:
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" >
<foobarMeowMeow xmlns="http://bungle-and-botch.com/spec/abrechnungsservice/types">
<foobarMeowHiss xmlns="">
<?xml version="1.0" encoding="iso-8859-15"?>
<foobarMeowHiss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="http://bungle-and-botch.com/spec/abrechnungsservice">
<woeM>
...
</woeM>
<foobarMeowHiss;>
</foobarMeowHiss>
<foobarHissMeow>
<?xml version="1.0" encoding="iso-8859-15"?>
<foobarHissMeow xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns="http://bungle-and-botch.com/spec/abrechnungsservice">
<jbrZ;>
...