How to efficiently store Huffman Tree and Encoded binary string into a file?

Question

I can easily convert a character string into a Huffman-Tree then encode into a binary sequence.

How should I save these to be able to actually compress the original data and then recover back?

I searched the web but I only could find guides and answers showing until what I already did. How can I use huffman algorithm further to actually achieve lossless compression?

I am using C# for this project.

EDIT: I've achieved these so far, might need rethinking.

I am attempting to compress a text file. I use Huffman Algorithm but there are some key points I couldn't figure out:

"aaaabbbccdef" when compressed gives this encoding

Key = a, Value = 11
Key = b, Value = 01
Key = c, Value = 101
Key = d, Value = 000
Key = e, Value = 001
Key = f, Value = 100

11111111010101101101000001100 is the encoded version. It normally needs 12*8 bits but we've compressed it to be 29 bits. This example might be a litte unnecessary for a file this small but let me explain what I tried to do.

We have 29 bits here but we need 8*n bits so I fill the encodedString with zeros until it becomes a multiple of eight. Since I can add 1 to 7 zeros it is more than enough to use 1-byte to represent this. This case I've added 3 zeros

11111111010101101101000001100000 Then add as binary how many extra bits I've added to the front and the split into 8-bit pieces

00000011-11111111-01010110-11010000-01100000

Turn these into ASCII characters

ÿVÐ`

Now if I have the encoding table I can look to the first 8bits convert that to integer ignoreBits and by ignoring the last ignoreBits turn it back to the original form.

The problem is I also want to include uncompressed version of encoding table with this file to have a fully functional ZIP/UNZIP prpgram but I am having trouble deciding when my ignoreBits ends, my encodingTable startse/ends, encoded bits start/end.

I thought about using null character but there is no assurance that Values cannot produce a null character. "ddd" in this situation produces 00000000-0.....

Can you post the relevant parts of the code you have so far please? (i.e the part where you have a binary data and now you want to save it) — Michael Schönbauer
Found an answer here if anyone interested can look this link and solve their problem I hate when someone says "nvm I found it" under a stackoverflow post — Kayra Uckilinc

Mark Adler Mark Adler · Accepted Answer · 2021-01-10T16:41:15

Your representation of the code needs to be self-terminating. Then you know the next bit is the start of the Huffman codes. One way is to traverse the tree that resulted from the Huffman code, writing a 0 bit for each branch, or a 1 bit followed by the symbol for leaf. When the traverse is done, you know the next bit must be the codes.

You also need to make your data self terminating. Note that in the example you give, the added three zero bits will be decoded as another 'd'. So you will incorrectly get 'aaaabbbccdefd' as the result. You need to either precede the encoded data with a count of symbols expected, or you need to add a symbol to your encoded set, with frequency 1, that marks the end of the data.

How to efficiently store Huffman Tree and Encoded binary string into a file?

1 Answers