I have about 700 Python source files (.py
) of a few kilobytes of size (average file size is 12 kB, but there are many 1 kB files as well), and I'd like to create a compressed archive containing all of them. My requirements:
- The archive should be small. (
.zip
files give me a compression ratio of 3.816, I need something smaller than that. A.rar
file created withrar -s -m5 a
gives me a compression ratio of 6.177, I'd prefer 7 or more.) - The compression must be lossless, it must preserve the original file bit-by-bit. (So minification is out.)
- There must be a small library written in C which can list the archive and extract individual files.
- The decompression library must be fast, i.e. not much slower than
zlib
, preferably faster. - If I want to extract a single file, I don't have to uncompress large, unrelated portions of the archive. (So compressed
.tar
files are out, and solid.rar
files are out.) - Since all
.py
files are small (only a few kilobytes in size), I don't need a streaming decompressor or seeking support within a file. - If possible, decompression should be initialized from a context dictionary generated from the union of the
.py
files, to save more space.
Which compression algorithm and C decompression library do you recommend?
I know about the concept of code minification (e.g. removing comments and extra whitespace, renaming local variables to single letter), and I'll consider using this technique for some of my .py
files, but in this question I'm not interested in it. (See a Python minifier here.)
I know about the concept of bytecode compilation (.pyc
files), but in this question I'm not interested in it. (The reason I don't want to have bytecode in the archive is that bytecode is architecture- and version-dependent, so it's less portable. Also .pyc
files tend to be a bit larger than minified .py
files.)
See my answers containing plan B and plan C. I'm still looking for plan A, which is smaller than ZIP (but it will be most probably larger than .tar.xz
), and it has smaller overhead than .tar.xz
.
tar
withxz
(LZMA2) compression, because as far as I can tell, it generally has the highest compression ratios out there, but you ruled them out both by rejecting thetar
format, and the fact that LZMA2 is a lot slower than zlib. – Delan Azabani