Disclaimer: This problem requires a very good knowledge of the DEFLATE algorithm.
I am hoping I could solicit some ideas identifying the compression algorithm being used in a particular file format. This is a legacy proprietary format that my application needs to support, so we are trying to reverse engineer it. (Going to the original creator is not an option, for reasons I won't get into).
I'm extremely close to cracking it, but I feel like I'm living Xeno's paradox because every day I seem to get halfway closer to the finish line but never there!
Here's what I know so far:
It is definitely using something extremely similar to the DEFLATE algorithm. Similarities -
- The compressed data is represented by canonical Huffman codes (usually starting with 000, but I'm not sure that is always the case).
- The data is preceded (I believe immediately) by a header table which identifies the bit lenghts of each of the actual codes. Like DEFLATE, this table ALSO comprises cannonical Huffman codes (starting either at 0 or 00). These codes provide the bit-lenghts of each character in the 0-255+ alphabet plus whatever distance codes might be used.
- Finally, again like DEFLATE, the header table with the bit lenghts for the main codes is also preceded (I think immediately) by a series of 3-bit codes used to derive the header table codes (I'll call this the "pre-header").
At this point the similarities seem to end though.
The 3-bit codes in the pre-header do not appear go in the 16, 17, 18, 0, 8 ... optimal order as specified by DEFLATE, but rather seem to go sequentially, like 6 7 8 9....
Another difference is that each 3-bit code is not necessarily a literal bit length. For example, here's a header that I've mostly deciphered (I'm 99.99% confident it is correct):
00000001011 100 010 110 010 010 011 010 110 101 100 011 010 010 011 100 010 111
*0* skA *3* *4* *5* *6* *7* *8* *9* skB
Ignoring the unmarked bits, this results in the following code table:
00 7-bits
01 8-bits
100 6-bits
101 9-bits
1100 0-bits (skip code)
1101 skA = skip 3 + value of next 4 bits
1110 5-bits
11110 4-bits
111110 skB = skip 11? + value of next 9 bits
111111 3-bits
The most glaring problem is that there are additional bit-lenghts in the header table that are unused. And, in fact, they would not be usable at all, as there cannot be any additional 2-bit or 3-bit codes, for example, for the codes to be canonical (right?).
The author is also using non-standard codes for 16+. They don't seem to use the copy code (16 in DEFLATE) at all; the main headers all have huge strings of identical length codes (terribly inefficient...), and the skip codes use the next 4 and 9 bits to determine the number of skips, respectively, rather than 3 and 7 as in DEFLATE.
Yet another key difference is in the very first bits of the header. In DEFLATE the first bits are HLIT(5), HDIST(5), and HCLEN(4). If I interpreted the above header that way using LSB packing, I'd get HLIT = 257 (correct), HDIST = 21 (unsure if correct) and HCLEN = 7 (definitely not correct). If I use MSB packing instead, I'd get HLIT=257, HDIST = 6 (more likely correct) and HCLEN = 16 (appears correct). BUT, I don't think there are actually intended to be 14 bits in the prefix because I appear to need the "100" (see above) for the bit count of the 0-bit (skip) code. And in other examples, bits 10-13 don't appear to correlate to the length of the pre-header at all.
Speaking of other examples, not every file appears to follow the same header format. Here's another header:
00000001000 100 100 110 011 010 111 010 111 011 101 010 110 100 011 101 000 100 011
In this second example, I again happen to know that the code table for the header is:
0 8-bits
10 7-bits
110 6-bits
11100 skA
11101 5-bits
111100 0-bits (skip)
111101 skB
111110 9-bits
1111110 3-bits
1111111 4-bits
However, as you can see, many of the required code lenghts are not in the header at all. For example there's no "001" to represent the 8-bit code, and they are not even close to being in sequence (neither consecutively nor by the optimal 16, 17, 18...).
And yet, if we shift the bits left by 1:
skA *0* *5* *6* *7* *8* *9*
0000000100 010 010 011 001 101 011 101 011 101 110 101 011 010 001 110 100 010 001 1
This is much better, but we still can't correctly derive the code for skB (110), or 3 or 4 (111). Shifting by another bit does not improve the situation.
Incidentally, if you're wondering how I am confident that I know the code tables in these two examples, the answer is A LOT of painstaking reverse engineering, i.e., looking at the bits in files that differ slightly or have discernable patterns, and deriving the canonical code table being used. These code tables are 99+% certainly correct.
To summarize, then, we appear to have an extremely close variant of DEFLATE, but for inexplicable reasons one that uses some kind of non-standard pre-header. Where I am getting tripped up, of course, is identifying which pre-header bits correspond to the code bit-lengths for the main header. If I had that, everything would fall into place.
I have a couple of other examples I could post, but rather than ask people to do pattern matching for me, what I'm really praying for is that someone will recognize the algorithm being used and be able to point me to it. I find it unlikely that the author, rather than use an existing standard, would have gone to the trouble of coding his own algorithm from scratch that was 99% like DEFLATE but then change the pre-header structure only slightly. It makes no sense; if they simply wanted to obfuscate the data to prevent what I'm trying to do, there are much easier and more effective ways.
The software dates back to the late 90s, early 2000s, by the way, so consider what was being done back then. This is not "middle out" or anything new and crazy. It's something old and probably obscure. I'm guessing some variant of DEFLATE that was in use in some semi-popular library around that time, but I've not been having much luck finding information on anything that isn't actually DEFLATE.
Many, many thanks for any input.
Peter
PS - As requested, here is the complete data block from the first example in the post. I don't know if it'll be of much use, but here goes. BTW, the first four bytes are the uncompressed output size. The fifth byte begins the pre-header.
B0 01 00 00 01 71 64 9A D6 34 9C 5F C0 A8 B6 D4 D0 76 6E 7A 57 92 80 00 54 51 16 A1 68 AA AA EC B9 8E 22 B6 42 48 48 10 9C 11 FE 10 84 A1 7E 36 74 73 7E D4 90 06 94 73 CA 61 7C C8 E6 4D D8 D9 DA 9D B7 B8 65 35 50 3E 85 B0 46 46 B7 DB 7D 1C 14 3E F4 69 53 A9 56 B5 7B 1F 8E 1B 3C 5C 76 B9 2D F2 F3 7E 79 EE 5D FD 7E CB 64 B7 8A F7 47 4F 57 5F 67 6F 77 7F 87 8F 97 9D FF 4F 5F 62 DA 51 AF E2 EC 60 65 A6 F0 B8 EE 2C 6F 64 7D 39 73 41 EE 21 CF 16 88 F4 C9 FD D5 AF FC 53 89 62 0E 34 79 A1 77 06 3A A6 C4 06 98 9F 36 D3 A0 F1 43 93 2B 4C 9A 73 B5 01 6D 97 07 C0 57 97 D3 19 C9 23 29 C3 A8 E8 1C 4D 3E 0C 24 E5 93 7C D8 5C 39 58 B7 14 9F 02 53 93 9C D8 84 1E B7 5B 3B 47 72 E9 D1 B6 75 0E CD 23 5D F6 4D 65 8B E4 5F 59 53 DF 38 D3 09 C4 EB CF 57 52 61 C4 BA 93 DE 48 F7 34 B7 2D 0B 20 B8 60 60 0C 86 83 63 08 70 3A 31 0C 61 E1 90 3E 12 32 AA 8F A8 26 61 00 57 D4 19 C4 43 40 8C 69 1C 22 C8 E2 1C 62 D0 E4 16 CB 76 50 8B 04 0D F1 44 52 14 C5 41 54 56 15 C5 81 CA 39 91 EC 8B C8 F5 29 EA 70 45 84 48 8D 48 A2 85 8A 5C 9A AE CC FF E8
Edit 7/11/2015
I've managed to decipher quite a bit additional information. The algorithm is definitely using LZ77 and Huffman coding. The length codes and extra bits seem to all match that used in Deflate.
I was able to learn a lot more detail about the pre-header as well. It has the following structure:
HLEN 0 SkS SkL ?? 3 4 5 6 7 8 9 HLIT
00000 00101110 001 0 1100 100 100 110 10 110 101 100 011 010 010 011 100010111
HLEN = the last bit-length in the pre-header - 3 (e.g. 1100 (12) means 9 is the last bit-length code)
HLIT = the number of literal codes in the main dictionary
SkS = "skip short" - skips a # of codes determined by the next 4-bits
SkL = "skip long" - skips a # of codes determined by the next 9-bits
0 - 9 = the number of bits in the dictionary codes for the respective bit lengths
The unmarked bits I'm still unable to decipher. Also, what I'm now seeing is that the pre-header codes themselves appear to have some extra bits thrown in (note the ?? between SkL and 3, above). They're not all straight 3-bit codes.
So, the only essential information that's now missing is:
- How to parse the pre-header for extra bits and whatnot; and
- How many distance codes follow the literal codes
If I had that information, I could actually feed the remaining data to zlib by manually supplying the code length dictionary along with the correct number of literal vs. distance codes. Everything after this header follows DEFLATE to the letter.
Here are some more example headers, with the bit-length codes indicated along with the number of literal and length codes. Note in each one I was able to reverse engineer the the answers, but I remain unable to match the undeciphered bits to those statistics.
Sample 1
(273 literals, 35 length, 308 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 3 ? 4 ? 5 6 7 8 9 HLIT
00000 00100010 010 0 1100 110 101 110 10 111 0 111 0 101 011 010 001 110 100010001
Sample 2
(325 literal, 23 length, 348 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 3 4 5 6 7 8 9 HLIT
00000 00110110 001 0 1100 101 101 110 10 110 000 101 000 011 010 001 101000101
Sample 3
(317 literal, 23 length, 340 total)
????? ???????? ??? ? HLEN 0 SkS SkL ??? 4 5 ? 6 7 8 9 HLIT
00000 01000100 111 0 1100 000 101 111 011 110 111 0 100 011 010 001 100111101
Sample 4
(279 literals, 18 length, 297 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 3 4 5 6 7 8 9 HLIT
00000 00101110 001 0 1100 100 100 110 10 110 101 100 011 010 010 011 100010111
Sample 5
(258 literals, 12 length, 270 total)
????? ???????? ??? ? HLEN 0 SkS SkL ?? 2 3 4 HLIT
00000 00000010 000 0 0111 011 000 011 01 010 000 001 100000010
I'm still hoping someone has seen a non-standard DEFLATE-style header like this before. Or maybe you'll see a pattern I'm failing to see... Many thanks for any further input.