I am making a packer(run-time compression) to study Windows PE format file. I know some data compression algorithms such as RLE, LZW, Huffman-endoing, etc. But Which algorithm is the best to compress binary data. just like .exe file? Does anyone can suggest which is the best to compress binary data ?
1 Answers
For a starter you should begin with a LZ77 or LZ78 algorithm which offer quite good compression ratio and a small decompression stub (obviously having a small decompression stub is a must have for a packer).
Following LZ7x algorithm is the LZMA algorithm which offers (usually) better compression than LZ7x algortihms.
If you have never wrote a packer before, I suggest that you write the decompression stub mostly in a low level language (C is the de facto language for that) in PIC (Position Independent Code) style and just some small parts in assembly language when needed. This has the advantage of letting the compiler doing most of the job for you for contradictory things (at least point 1 and 2):
- The decompression stub code length must be minimal
- The speed of the decompression stub code must be optimal
- Memory usage for compression and decompression must be kept in sensible limits
You can then tweak the output assembly to your convenience for a good trade-off between the above points.
Once you have a good understanding of compression theory, you should definitively seek to implement a PAQ derived compressor.
There are multiple advantage on following the PAQ lead:
It is know to be the best compressor in multiple fields (text, image and executable, although with different modeling context each time). See various benchmarks here and here.
It is open source (and follow the GPL licence).
Try especially to follow especially the PAQ8PX variant for a start. Injecting a minimal (in length) and fast decompression stub in the resulting compressed PE file will be the most difficult part of the job though.
PAQ algorithm is also used in kkrunchy a well-known PE compressor by the farbrausch demoscene group. A quite good glimpse at its internals is explained here.
As a final word, if you are not accustomed to data compression theory, I'd suggest as a first read, the very good intro Data Compression Explained by Matt Mahoney (author of PAQ) and the wiki book about data compression theory.
Keep in mind that compression is always a trade-off: the best compression ratio is not always what end-users wants. If you need 256 GB of memory or wait 5 minutes or have a decompression stub of 10 MB bytes to decompress, this is clearly not the right path...