LZW Decompression

Question

I am implementing a LZW algorithm in C++.

The size of the dictionary is a user input, but the minimum is 256, so it should work with binary files. If it reaches the end of the dictionary it goes around to the index 0 and works up overwriting it from there.

For example, if i put in a alice in wonderland script and compress it with a dictionary size 512 i get this dictionary.

But i have a problem with decompression and the output dictionary from decompressing the compressed file looks like this.

And my code for decompressing looks like this

struct dictionary
{
    vector<unsigned char> entry;
    vector<bool> bits;
};

void decompress(dictionary dict[], vector<bool> file, int dictionarySize, int numberOfBits)
{
    //in this example
    //dictionarySize = 512, tells the max size of the dictionary, and goes back to 0 if it reaches 513
    //numberOfBits = log2(512) = 9
    //dictionary dict[] contains bits and strings (strings can be empty)
    // dict[0] = 
    //            entry = (unsigned char)0
    //            bits = (if numberOfBits = 9) 000000001
    // dict[255] = 
    //            entry = (unsigned char)255
    //            bits = (if numberOfBits = 9) 011111111
    // so the next entry will be dict[next] (next is currently 256)
    // dict[256] = 
    //            entry = what gets added in the code below
    //            bits = 100000000
    // all the bits are already set previously (dictionary size is int dictionarySize) so in this case all the bits from 0 to 511 are already set, entries are set from 0 to 255, so extended ASCII


    vector<bool> currentCode;
    vector<unsigned char> currentString;
    vector<unsigned char> temp;

    int next=256;
    bool found=false;

    for(int i=0;i<file.size();i+=numberOfBits)
    {
        for(int j=0;j<numberOfBits;j++)
        {
            currentCode.push_back(file[i+j]);
        }

        for(int j=0;j<dictionarySize;j++)
        {
            // when the currentCode (size numberOfBits) gets found in the dictionary
            if(currentCode==dict[j].bits)
            {
                currentString = dict[j].entry;

                // if the current string isnt empty, then it means it found the characted in the dictionary
                if(!currentString.empty())
                {
                    found = true;
                }
            }
        }

        //if the currentCode in the dictionary has a string value attached to it
        if(found)
        {
            for(int j=0;j<currentString.size();j++)
            {
                cout<<currentString[j];
            }

            temp.push_back(currentString[0]);

            // so it doesnt just push 1 character into the dictionary
            // example, if first read character is 'r', it is already in the dictionary so it doesnt get added 
            if(temp.size()>1)
            {
                // if next is more than 511, writing to that index would cause an error, so it resets back to 0 and goes back up
                if(next>dictionarySize-1) //next > 512-1
                {
                    next = 0;
                }
                dict[next].entry.clear();
                dict[next].entry = temp;
                next++;
            }

            //temp = currentString;
        }
        else
        {
            currentString = temp;
            currentString.push_back(temp[0]);

            for(int j=0;j<currentString.size();j++)
            {
                cout<<currentString[j];
            }

            // if next is more than 511, writing to that index would cause an error, so it resets back to 0 and goes back up
            if(next>dictionarySize-1)
            {
                next = 0;
            }
            dict[next].entry.clear();
            dict[next].entry = currentString;
            next++;

            //break;
        }

        temp = currentString;

        // currentCode gets cleared, and written into in the next iteration
        currentCode.clear();

        //cout<<endl;
        found = false;
    }
}

Im am currently stuck and dont know what to fix here to fix the output. I have also noticed, that if i put a dictionary big enough, so it doesnt go around the dictionary (it doesnt reach the end and begin again at 0) it works.

sorry, i fixed it and now its as i have it, i messed up when translating my code to english. It how its supposed to be now — qlabfgerkaSmurf
The first thing would be to properly document and comment your code: a) what are "all external symbols" good for? (void decompress(dictionary dict[], vector<bool> file, int dictionarySize, int numberOfBits), here) b) what do the non-trivial conditions mean? In the particular code above, what shall happen to the contents of temp if temp.size()<=1, and why? Then, there are interpretations where the size of a dictionary entry stays constant: 1) (code for prefix, suffix character) 2) (offset in plain text, length) - what you use now can grow as fast as the text. — greybeard
too lazy to go through your code and this is also not my cup of tea so it might be just a silly question/hint but: Do you use Clear codes? I did GIF encoder/decoder in the past which uses LZW and it uses clear codes which tells your decompresor the dictionary should be cleared. Otherwise the compression and decompression could get out of sync corrupting the decompressed data.... unless you add to dictionary at consistent predictable rate — Spektre
I added answer with mine simple LZW encoder/decoder (following the Wiki page linked) it looks like its working correctly so just cross-check your code with mine (and or the sub result while stepping/traceing your code in debugger) — Spektre
I updated the code a bit to handle unaligned remainder bits at the end and configured it for 8bit data ... — Spektre

Spektre Spektre · Accepted Answer · 2020-05-30T08:54:59

start small

you are using files that is too much data to debug. Start small with strings. I took this nice example from Wikli:

Input: "abacdacacadaad"

step    input           match   output  new_entry   new_index
                                        a           0
                                        b           1
                                        c           2
                                        d           3
1       abacdacacadaad  a       0       ab          4
2       bacdacacadaad   b       1       ba          5
3       acdacacadaad    a       0       ac          6
4       cdacacadaad     c       2       cd          7
5       dacacadaad      d       3       da          8
6       acacadaad       ac      6       aca         9
7       acadaad         aca     9       acad        10
8       daad            da      8       daa         11
9       ad              a       0       ad          12
10      d               d       3       

Output: "0102369803"

So you can debug your code step by step with cross matching both input/output and dictionary contents. Once that is done correctly then you can do the same for decoding:

Input: "0102369803"

step    input   output  new_entry   new_index
                        a           0
                        b           1
                        c           2
                        d           3
1       0       a       
2       1       b       ab          4
3       0       a       ba          5
4       2       c       ac          6
5       3       d       cd          7
6       6       ac      da          8
7       9       aca     aca         9
8       8       da      acad        10
9       0       a       daa         11
10      3       d       ad          12

Output: "abacdacacadaad"

Only then move to files and clear dictionary handling.

bitstream

once you succesfully done the LZW on small alphabet you can try to use the full alphabet and bit encoding. You know the LZW stream can be encoded at any bitlength (not just 8/16/32/64 bits) which can greatly affect compression ratios (in respect to used data properties). So I would try to do univeral access to data at variable (or predefined bitlength).

Was a bit curious so I encoded a simple C++/VCL example for the compression:

//---------------------------------------------------------------------------
// LZW
const int LZW_bits=12;              // encoded bitstream size
const int LZW_size=1<<LZW_bits;     // dictinary size
// bitstream R/W
DWORD bitstream_tmp=0;
//---------------------------------------------------------------------------
// return LZW_bits from dat[adr,bit] and increment position (adr,bit)
DWORD bitstream_read(BYTE *dat,int siz,int &adr,int &bit,int bits)
    {
    DWORD a=0,m=(1<<bits)-1;
    // save tmp if enough bits
    if (bit>=bits){ a=(bitstream_tmp>>(bit-bits))&m; bit-=bits; return a; }
    for (;;)
        {
        // insert byte
        bitstream_tmp<<=8;
        bitstream_tmp&=0xFFFFFF00;
        bitstream_tmp|=dat[adr]&255;
        adr++; bit+=8;
        // save tmp if enough bits
        if (bit>=bits){ a=(bitstream_tmp>>(bit-bits))&m; bit-=bits; return a; }
        // end of data
        if (adr>=siz) return 0;
        }
    }
//---------------------------------------------------------------------------
// write LZW_bits from a to dat[adr,bit] and increment position (adr,bit)
// return true if buffer is full
bool bitstream_write(BYTE *dat,int siz,int &adr,int &bit,int bits,DWORD a)
    {
    a<<=32-bits;        // align to MSB
    // save tmp if aligned
    if ((adr<siz)&&(bit==32)){ dat[adr]=(bitstream_tmp>>24)&255; adr++; bit-=8; }
    if ((adr<siz)&&(bit==24)){ dat[adr]=(bitstream_tmp>>16)&255; adr++; bit-=8; }
    if ((adr<siz)&&(bit==16)){ dat[adr]=(bitstream_tmp>> 8)&255; adr++; bit-=8; }
    if ((adr<siz)&&(bit== 8)){ dat[adr]=(bitstream_tmp    )&255; adr++; bit-=8; }
    // process all bits of a
    for (;bits;bits--)
        {
        // insert bit
        bitstream_tmp<<=1;
        bitstream_tmp&=0xFFFFFFFE;
        bitstream_tmp|=(a>>31)&1;
        a<<=1; bit++;
        // save tmp if aligned
        if ((adr<siz)&&(bit==32)){ dat[adr]=(bitstream_tmp>>24)&255; adr++; bit-=8; }
        if ((adr<siz)&&(bit==24)){ dat[adr]=(bitstream_tmp>>16)&255; adr++; bit-=8; }
        if ((adr<siz)&&(bit==16)){ dat[adr]=(bitstream_tmp>> 8)&255; adr++; bit-=8; }
        if ((adr<siz)&&(bit== 8)){ dat[adr]=(bitstream_tmp    )&255; adr++; bit-=8; }
        }
    return (adr>=siz);
    }
//---------------------------------------------------------------------------
bool str_compare(char *s0,int l0,char *s1,int l1)
    {
    if (l1<l0) return false;
    for (;l0;l0--,s0++,s1++)
     if (*s0!=*s1) return false;
    return true;
    }
//---------------------------------------------------------------------------
AnsiString LZW_encode(AnsiString raw)
    {
    AnsiString lzw="";
    int i,j,k,l;
    int adr,bit;
    DWORD a;
    const int siz=32;                   // bitstream buffer
    BYTE buf[siz];
    AnsiString dict[LZW_size];          // dictionary
    int dicts=0;                        // actual size of dictionary

    // init dictionary
    for (dicts=0;dicts<256;dicts++) dict[dicts]=char(dicts);    // full 8bit binary alphabet
//  for (dicts=0;dicts<4;dicts++) dict[dicts]=char('a'+dicts);  // test alphabet "a,b,c,d"

    l=raw.Length();
    adr=0; bit=0;
    for (i=0;i<l;)
        {
        i&=i;
        // find match in dictionary
        for (j=dicts-1;j>=0;j--)
         if (str_compare(dict[j].c_str(),dict[j].Length(),raw.c_str()+i,l-i))
            {
            i+=dict[j].Length();
            if (i<l)    // add new entry in dictionary (if not end of input)
                {
                // clear dictionary if full
                if (dicts>=LZW_size) dicts=256; // full 8bit binary alphabet
//              if (dicts>=LZW_size) dicts=4;   // test alphabet "a,b,c,d"
                else{
                    dict[dicts]=dict[j]+AnsiString(raw[i+1]); // AnsiString index starts from 1 hence the +1
                    dicts++;
                    }
                }
            a=j; j=-1; break;       // full binary output
//          a='0'+j; j=-1; break;   // test ASCII output
            }
        // store result to bitstream
        if (bitstream_write(buf,siz,adr,bit,LZW_bits,a))
            {
            // append buf to lzw
            k=lzw.Length();
            lzw.SetLength(k+adr);
            for (j=0;j<adr;j++) lzw[j+k+1]=buf[j];
            // reset buf
            adr=0;
            }
        }
    if (bit)
        {
        // store the remainding bits with zeropad
        bitstream_write(buf,siz,adr,bit,LZW_bits-bit,0);
        }
    if (adr)
        {
        // append buf to lzw
        k=lzw.Length();
        lzw.SetLength(k+adr);
        for (j=0;j<adr;j++) lzw[j+k+1]=buf[j];
        }
    return lzw;
    }
//---------------------------------------------------------------------------
AnsiString LZW_decode(AnsiString lzw)
    {
    AnsiString raw="";
    int adr,bit,siz,ix;
    DWORD a;
    AnsiString dict[LZW_size];          // dictionary
    int dicts=0;                        // actual size of dictionary

    // init dictionary
    for (dicts=0;dicts<256;dicts++) dict[dicts]=char(dicts);    // full 8bit binary alphabet
//  for (dicts=0;dicts<4;dicts++) dict[dicts]=char('a'+dicts);  // test alphabet "a,b,c,d"

    siz=lzw.Length();
    adr=0; bit=0; ix=-1;
    for (adr=0;(adr<siz)||(bit>=LZW_bits);)
        {
        a=bitstream_read(lzw.c_str(),siz,adr,bit,LZW_bits);
//      a-='0';                         // test ASCII input
        // clear dictionary if full
        if (dicts>=LZW_size){ dicts=4; ix=-1; }
        // new dictionary entry
        if (ix>=0)
            {
            if (a>=dicts){ dict[dicts]=dict[ix]+AnsiString(dict[ix][1]); dicts++; }
            else         { dict[dicts]=dict[ix]+AnsiString(dict[a ][1]); dicts++; }
            } ix=a;
        // update decoded output
        raw+=dict[a];
        }
    return raw;
    }
//---------------------------------------------------------------------------

and output using // test ASCII input lines:

txt="abacdacacadaad"
enc="0102369803"
dec="abacdacacadaad"

where AnsiString is the only VCL stuff I used and its just self allocating string variable beware its indexes starts at 1.

AnsiString s;
s[5]              // character access (1 is first character) 
s.Length()        // returns size
s.c_str()         // returns char*
s.SetLength(size) // resize

So just use any string you got ...

In case you do not have BYTE,DWORD use unsigned char and unsigned int instead ...

Looks like its working for long texts too (bigger than dictionary and or bitstream buffer sizes). However beware that the clearing might be done in few different places of code but must be synchronized in both encoder/decoder otherwise after clearing the data would corrupt.

The example can use either just "a,b,c,d" alphabet or full 8it one. Currently is set for 8bit. If you want to change it just un-rem the // test ASCII input lines and rem out the // full 8bit binary alphabet lines in the code.

To test crossing buffers and boundary you can play with:

const int LZW_bits=12;              // encoded bitstream size
const int LZW_size=1<<LZW_bits;     // dictinary size

and also with:

const int siz=32;                   // bitstream buffer

constants... The also affect performance so tweak to your liking. Beware the bitstream_write is not optimized and can be speed up considerably ...

Also in order to debug 4bit aligned coding I am using hex print of encoded data (hex string is twice as long as its ASCII version) like this (ignore the VCL stuff):

AnsiString txt="abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa",enc,dec,hex;
enc=LZW_encode(txt);
dec=LZW_decode(enc);

// convert to hex
hex=""; for (int i=1,l=enc.Length();i<=l;i++) hex+=AnsiString().sprintf("%02X",enc[i]);

mm_log->Lines->Add("\""+txt+"\"");
mm_log->Lines->Add("\""+hex+"\"");
mm_log->Lines->Add("\""+dec+"\"");
mm_log->Lines->Add(AnsiString().sprintf("ratio: %i%",(100*enc.Length()/dec.Length())));

and result:

"abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa"
"06106206106306410210510406106410FFFFFF910A10706110FFFFFFD10E06206311110910FFFFFFE11410FFFFFFD0"
"abacdacacadaadddddddaaaaaaaabcccddaaaaaaaaa"
ratio: 81%

LZW Decompression

1 Answers