How do i find variable length Huffman codes in a string for a best fit?

Question

I have a string, i want to find the most variable length characters up to a length to make better huffman codes:

For example, the string

"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"

from which i manually created the following Huffman to symbol to frequency string:

In [1665]: symb2freq                                                                                                                                                                                       
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}

This allowed me to have the Huffman binary string of 110011111000011010110011 to represent:

'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'

which is half the size and which is better than fixed length symb2frequencies. My question is, it there a regex, or itertools, or method of programming to create symb2frequencies of multiple lengths, with a largest length. In my case the largest length is 11. The shortest length doesn't matter, but the goal is to get the best frequencies possible to reduce the size of the huffman code to represent the tree. If i group by a set size, i can 't get a short Huffman binary string to represent the best code set.

So my question is how do i programmatically, using regex or itertools, or any other method to convert:

"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"

to:

In [1665]: symb2freq                                                                                                                                                                                       
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}

which i built manually.

I think this is a hard problem, so thanks in advance for any advice on how to tackle this issue.

I wrote this code to show that you can use variable length codes to decode a huffman code to the pattern shown, which i use to recreate the number 1009732533765201 in half the size. I created the sym2freq manually and need help on ideas on how to create it programmatically so i can achieve better compression than set grouping sizes. I have a program that creates a mathematical pattern to recreate a number, and i'm looking to create variable length codes to best represent those patterns which are in the format of + and - 's

# Incorporated a better method of rebuilding the string from RoyM
# Uncompresses the number: 1009732533765201
# which has the binary:
# 0b11100101100101100010101100111111100100010001010001
# Our compressed binary is:
# 110011111000011010110011
# We compress a variable length pattern that is decoded 
# to try to achieve nearly as good compression (maybe better in this case )and to more importantly 
# to compare compression as this is a different approach
# to that of binary compression of Huffman using XOR Patterns
# S is a representation of XOR math that is required to rebuild the original number
# so we don't compress the number itself, we compress the XOR math to rebuild the number
# instead

s={'11': '++--', '01': '-+-+-', '000': '+-++++++', '001': '--', '100': '--+-', '101': '----++-+--'}
compressed_binary='110011111000011010110011'

def decodepattern(pattern, j=1):
   x=2
   for c in range(len(pattern)):
     if pattern[c]=='+' or pattern[c]=='1':
       j^=x
     else:
       j^=-x
     x<<=1
   return abs(j)


i = 0
j = 0
uncompressed_string = ''
while(j<len(compressed_binary)):
    if compressed_binary[i:j] in s:
        uncompressed_string += s[compressed_binary[i:j]]
        i = j
    else:
        j += 1
uncompressed_string += s[compressed_binary[i:j]]
answer = decodepattern(uncompressed_string)

print("Bin length of 1009732533765201 is 50 which is the same as the pattern length")
print("Original Bin of 1009732533765201 is: ")
print(bin(1009732533765201)[2:])
print("Original Pattern:")
print(uncompressed_string)
print("Compressed to: ", bin(13600435)[2:])
print("110011111000011010110011 uncompressed to the original string pattern of : ")
print(uncompressed_string)
print("Which then i was able to recreate the original integer of: ")
print(decodepattern(uncompressed_string))

Output of program is:

Bin length of 1009732533765201 is 50 which is the same as the pattern length
Original Bin of 1009732533765201 is: 
11100101100101100010101100111111100100010001010001
Original Pattern:
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Compressed to:  110011111000011010110011
110011111000011010110011 uncompressed to the original string pattern of : 
b'++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
Which then i was able to recreate the original integer of: 
1009732533765201

So you can see i can use variable length codes to decompress a number, i just don't know how to create variable length codes (up to a certain length) without doing it manually.

Expected result is some sort of regex, or itertools method or any other method to take the string:

"++----++--++--+-++++++-+-+-----++-+---+-+---+-++--"

and create a symb2freq like this ( which i did manually ):

In [1665]: symb2freq                                                                                                                                                                                       
Out[1665]: {'++--': 4, '-+-+-': 2, '--': 1, '+-++++++': 1, '----++-+--': 1, '--+-': 1}

How do you expect to know if you've obtained the "right" partitioning of the string into symbols, and how do you expect to convey this partitioning to a reader that has to decode the message? Are you sure this is valuable compression? — RoyM
It seems that it's much better than fixed length codes, since the bin is smaller. I'm not sure about how i'll know that i've obtained the right partitioning, but i can say that manually making the codes seemed to maximize the compression of the number. I tried using a longest string match to shortest to build the code table, but the resulting binary was bigger than the manual fit. This table seemed to maximize the compression, so i think it's the smallest symbol table to binary representation fit for compression of the number, using Huffman. — oppressionslayer
Yes, it is better. But your selection-process is unclear, and therefore your problem is lacking in its description. — RoyM
If you can group characters into symbols arbitrarily, then the optimal solution would be to have a single symbol (the entire string), resulting in 1 bit of compressed Huffman data. To make any meaningful comparison of grouping strategies, you'd have to specify how you convey the grouping to the receiver. — jasonharper
Why not start with some proven open-source code that already does this and modify it to suit your needs? Wikipedia mentions the PKZIP algorithm, but I'd be surprised if there weren't a lot of other well-known compression algorithms that use Huffman codes. — bitinerant

MEE MEE · Accepted Answer · 2019-10-31T16:01:45

Although I am not certain about the soundness of the question, here is how you can bruteforce all possible codebooks:

#!/usr/bin/env python3

import huffman


def find_best_codebook(string, max_symbol_length, max_partitions):
    min_entropy = len(string)
    best_i = None
    best_weights = None
    best_codebook = None
    for i, tokens in enumerate(gen_partitions(string, max_partitions=max_partitions)):
        if any(len(t) > max_symbol_length for t in tokens):
            continue
        symbol_weights = compute_weights(tokens)
        codebook = huffman.codebook(symbol_weights.items())
        entropy = compute_entropy(symbol_weights, codebook)
        print(f'i={i} X={symbol_weights} H(X)={entropy}')

        if entropy < min_entropy:
            best_i = i
            best_weights = symbol_weights
            best_codebook = codebook
            min_entropy = entropy

    print(f'best i={best_i} X={best_weights} H(X)={min_entropy}')
    print('best codebook:', best_codebook)
    return best_codebook, min_entropy


def gen_partitions(string, max_partitions=None):
    if max_partitions is None:
        max_partitions = len(string)

    if max_partitions <= 0:
        yield [string]
        return

    for index in range(len(string) + 1):
        for rest in gen_partitions(string[index:], max_partitions - 1):
            yield [string[:index]] + rest


def compute_weights(tokens):
    freqs = {}
    for token in tokens:
        if not token:
            continue
        freqs[token] = freqs.get(token, 0) + 1
    return freqs


def compute_entropy(symbol_weights, codebook):
    return sum(len(code) * symbol_weights[symbol] for symbol, code in codebook.items())


if __name__ == '__main__':
    string = '++----++--++--+-++++++-+-+-----++-+---+-+---+-++--'
    max_symbol_length = 11
    max_partitions = len(string) - max_symbol_length
    find_best_codebook(string, max_symbol_length, max_partitions)

Note that partitioning a string of length N into all possible partitions is exponential in the string length (N split points, two decisions at each: split and no split, hence O(2^N)).

Also, given the way the question is formed, you should be able to get the best compression by simply partitioning the input into consecutive chunks of maximum codeword length. For example, using a maximum symbol length of 11 as stated in the question, you can do {'++----', '++--++--+-+', '+++++-+-+--', '---++-+---+', '-+---+-++--'} to get 5 symbols (at most 3-bit codewords) and a compressed text of at most 15 bits.

How do i find variable length Huffman codes in a string for a best fit?

1 Answers