Reconstructing Huffman tree from (preorder) bitstring in Haskell

Question

I have the following Haskell polymorphic data type:

data Tree a = Leaf Int a | Node Int (Tree a) (Tree a)

The tree will be compressed in a bitstring of 0s and 1s. A '0' signifies a Node and it is followed by the encoding of the left subtree, then the encoding of the right subtree. A '1' signifies a Leaf and is followed by 7 bits of information (for example it might be a char). Each node/leaf is supposed to also contain the frequency of the information stored, but this is not important for this problem (so we can put anything there).

For example, starting from this encoded tree

[0,0,0,1,1,1,0,1,0,1,1,1,1,1,1,0,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,
 1,0,0,1,1,1,1,1,1,1,0,0,1,0,0,1,1,1,1,0,1,1,1,1,1,1,0,0,0,0,1]

it is supposed to give back something like this

Node 0 (Node 0 (Node 0 (Leaf 0 'k') (Leaf 0 't')) 
       (Node 0 (Node 0 (Leaf 0 'q') (Leaf 0 'g')) (Leaf 0 'r'))) 
(Node 0 (Leaf 0 'w') (Leaf 0 'a'))

(spacing is not important, but it did not fit on one line).

I have little experience working with trees, especially when implementing code. I have a vague idea about how I'd solve this on paper (using something similar to a stack to deal with the depth/levels) but I am still a bit lost.

Any help or ideas are appreciated!

Benjamin Hodgson Benjamin Hodgson · Accepted Answer · 2016-12-24T01:53:53

Well, you're trying to parse a tree of bytes from a bit-stream. Parsing's one of those cases where it pays to set up some structure: we're going to write a miniature parser combinator library in the style of How to Replace Failure by a List of Successes, which will allow us to write our code in an idiomatic functional style and delegate a lot of the work to the machine.

Translating the old rhyme into the language of monad transformers, and reading "string" as "bit-string", we have

newtype Parser a = Parser (StateT [Bool] [] a)
    deriving (Functor, Applicative, Monad, Alternative)

runParser :: Parser a -> [Bool] -> [(a, [Bool])]
runParser (Parser m) = runStateT m

A parser is a monadic computation which operates statefully on a stream of Booleans, yielding a collection of successfully-parsed as. GHC's GeneralizedNewtypeDeriving superpowers allow me to elide the boilerplate instances of Monad et al.

The goal, then, is to write a Parser (Tree SevenBits) - a parser which returns a tree of septuples of Booleans. (You can turn the 7 bits into a Word8 at your leisure by deriving a Functor instance for Tree and using fmap.) I'm going to use the following definition of Tree because it's simpler - I'm sure you can figure out how to adapt this code to your own ends.

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show

type SevenBits = (Bool, Bool, Bool, Bool, Bool, Bool, Bool)

Here's a parser that attempts to consume a single bit from the input stream, failing if it's empty:

one :: Parser Bool
one = Parser $ do
    stream <- get
    case stream of
        [] -> empty
        (x:xs) -> put xs *> return x

Here's one which attempts to consume a particular bit from the input stream, failing if it doesn't match:

bit :: Bool -> Parser ()
bit b = do
    i <- one
    guard (i == b)

Here I'm pulling a sequence of seven Booleans from the input stream using replicateM and packing them into a tuple. We'll be using this to populate Leaf nodes' contents.

sevenBits :: Parser SevenBits
sevenBits = pack7 <$> replicateM 7 one
    where pack7 [a,b,c,d,e,f,g] = (a, b, c, d, e, f, g)

Now we can finally write the code which parses the tree structure itself. We'll be choosing between the Node and Leaf alternatives using <|>.

tree :: Parser (Tree SevenBits)
tree = node <|> leaf
    where node = bit False *> liftA2 Node tree tree
          leaf = bit True *> fmap Leaf sevenBits

If node succeeds in parsing a low bit from the head of the stream, it continues to recursively parse the encoding of the left subtree followed by the right subtree, sequencing the applicative actions with liftA2. The trick is that node fails if it doesn't encounter a low bit at the head of the input stream, which tells <|> to give up on node and try leaf instead.

Note how the structure of tree reflects the structure of the Tree type itself. This is applicative parsing at work. We could alternately have structured this parser monadically, first using one to parse an arbitrary bit and then using a case analysis on the bit to determine whether we should continue to parse a pair of trees or a leaf. In my opinion this version is simpler, more declarative, and less verbose.

Also compare the clarity of this code to the low-level style of @behzad.nouri's foldr-based solution. Rather than building an explicit finite-state machine which switches between parsing nodes and leaves - an imperative-flavoured idea - my design allows you to declaratively describe the grammar to the machine using standard functions like liftA2 and <|> and trust that the abstractions will do the right thing.

Anyway, here I'm parsing a simple tree consisting of a pair of Leafs containing the (binary-encoded) numbers 0 and 1. As you can see, it returns the single successful parse and an empty stream of remaining bits.

ghci> runParser tree $ map (>0) [0, 1, 0,0,0,0,0,0,0, 1, 0,0,0,0,0,0,1]
[(Node (Leaf (False, False, False, False, False, False, False)) (Leaf (False, False, False, False, False, False, True)),[])]

Reconstructing Huffman tree from (preorder) bitstring in Haskell

3 Answers