How to write a parser for a mutually recursive ADT without recursion and side effects?

Question

Warning: long and complicated question incoming.

Some believe total functional programming is a valuable idea, and so is finding techniques for doing it. Minding that, how can I write a parser for a mutually recursive ADT without recursion and side effects? Here, I'm defining as "recursive" any term that isn't strongly normalizing.

What I have tried:

Mind this following mutually recursive ADT:

data Tree = Node Int [Tree]
tree = Node 10 [Node 20 [], Node 30 [], Node 40 []]

The value, tree, could be serialized as:

tree_serial = [0,10,0,0,20,1,0,0,30,1,0,0,40,1,1] :: [Int]

Using ints for simplicity, here, 0 denotes the beginning of a Node or a Cons cell (depending on the state of the parser), 1 denotes Nil and the remaining denotes data. We could easily write a parser for it using side effects:

var string = [0,10,0,0,20,1,0,0,30,1,0,0,40,1,1];

function parse(string){
    function getInt(){
        return string.shift();
    };
    function parseTree(){
        var chr = getInt();
        if (chr === 0)
            return ["Node",getInt(),parseList()];
    };
    function parseList(){
        var chr = getInt();
        if (chr === 0)
            return ["Cons",parseTree(),parseList()];
        if (chr === 1)
            return "Nil";
    };
    return parseTree();
};

console.log(JSON.stringify(parse(string)));

Here, getInt is side-effective: it gets the next int from the string. We could easily and elegantly translate this to Haskell using Parsec or similar - but for better understanding, I skipped those and defined a stripped down parser type:

data Parser res = GetInt (Int -> Parser res) | Return res
runParser (GetInt fn) (c:cs) = runParser (fn c) cs
runParser (Return res) c     = res

This works similar to a monadic parser, except more explicitly:

main = do
    let parsePair = (GetInt (\a -> (GetInt (\b -> Return (a,b)))))
    print $ runParser parsePair [1,2,3,4,5]

Using this, we could define our parser without side effects:

data Tree = Node Int [Tree] deriving Show

parseTree = treeParser Return where
    treeParser = (\ cont -> 
        GetInt (\ _ ->  
            GetInt (\ tag -> 
                listParser (\ listParsingResult -> 
                    (cont (Node tag listParsingResult)))))) 
    listParser = (\ cont -> 
        GetInt (\ a -> 
            if a == 0 
                then treeParser (\x -> listParser (\y -> cont (x : y)))
                else cont []))

main = do
    let treeData = [0,10,0,0,20,1,0,0,30,1,0,0,40,1,1]
    print $ runParser parseTree treeData

This outputs Node 10 [Node 20 [],Node 30 [],Node 40 []], as expected. Notice this still uses recursion, and I had to use cont to pass the control between the two recursive functions. Now, there are 2 strategies to get rid of recursion I'm aware of:

1. Use folds.

2. Use church numbers for bounded recursion.

Using folds is obviously not viable here, since there is no structure to fold on (we are building it!). If we were parsing a list instead of a Tree, using church numbers would be perfect since they work exactly like the Y-combinator for bounded recursion - and, knowing the length of the list, we could just write toChurch listLength listParser init. The problem with this case, though, is that there is mutual recursion going on, and it is not obvious which church number to use. We have many layers of lists and trees of unpredictable lengths. Indeed, if we use a big enough church number, it works without recursion, but at the cost of added work. This is one of the last examples of an actually useful program I couldn't replicate "correctly" without recursion. Can it be done?

For completeness, here is a JavaScript program which parses that tree without recursion, but using made-up church numbers:

function runParser(f){return function(str){
    var a = f(str[0]);
    return a(str.slice(1));
}};
function Const(a){return function(b){return a}};
function toChurch(n){return (function(f){return (function(a){ 
    for (var i=0; i<n; ++i) 
        a  =  f(a); 
    return a; 
}) }) };
function parser(get){
    return toChurch(50)(function(rec){
        return function (res){
            return get(function(a){
                return [get(function(b){
                    return toChurch(50)(function(recl){
                        return function(res){
                            return get(function(a){
                                return [
                                    rec(function(a){
                                        return recl(function(b){
                                            return res(["Cons",a,b])
                                        })
                                    }),
                                    res("Nil")][a];
                            });
                        };
                    })(0)(function(x){return res(["Node",b,x])});
                })][a];
            });
        };
    })(0)(Const);
};
var string = [0,200,0,0,300,0,0,400,1,0,0,500,1,0,0,500,1,1,0,0,600,0,0,700,1,0,0,800,1,0,0,900,1,1,1];
console.log(JSON.stringify(parser(runParser)(string)));

Notice that 50 constant inside the parser function: it is completely arbitrary as a bound. I'm not sure there are "right" choices for those which would "fit perfectly" a specific parseable value.

Do you want to do this without recursion or without explicit recursion? If you use a fold then it's still using recursion, and technically if you're using a list then you're using a recursive type (in Haskell). You could use a vector type if you know what size to make it to start with, but trying to un-recurse recursive data structures isn't a very valuable exercise IMO. — bheklilr
@bheklilr, I updated my question to make my definition of "recursive function" unambiguous. Thank you! — MaiaVictor
"Total" doesn't mean, that you can't have recursion or even infinite chains of something. It's ok to parse an infinite string in a total function — you just need corecursion for that. All total functions must be either structurally recursive (you can find some techniques here) or productive (see e.g. this). There is a paper about total parser combinators. — user3237465
Total parser combinators... for God's sake, I'm gonna quit IT and open a food truck. Will be easier. — paulotorrens
I mean the normal form in the sense that there are no more redexes on the term, including inside abstractions. That is, the term is "strongly normalizing", as defined on this article. I find that is the only sensible definition of totality, and is actually included on its definition on the respective Wikipedia article. There are terms that are provably terminating, but not strongly normalizing - after reading a bit, I guess those parsers fit that category. :( — MaiaVictor

Daniel Wagner Daniel Wagner · Accepted Answer · 2015-07-21T06:08:00

tl;dr: Church-encode your input list, and use that to drive your recursion.

A proper Church-encoding of lists requires RankNTypes, and looks a bit like this:

{-# LANGUAGE RankNTypes #-}

data List a = List { runList :: forall r. (a -> r -> r) -> r -> r }
instance Show a => Show (List a) where
    showsPrec n (List xs) = showsPrec n (xs (:) [])

nilVal :: List a
nilVal = List $ \cons nil -> nil

consVal :: a -> List a -> List a
consVal a (List as) = List $ \cons nil -> cons a (as cons nil)

-- handy for pattern-matching
uncons :: List a -> Maybe (a, List a)
uncons (List xs) = xs cons nil where
    cons x Nothing = Just (x, nilVal)
    cons x (Just (x', xs)) = Just (x, consVal x' xs)
    nil = Nothing

Now we just have to write our parser. I'm really bad with parsing theory, so I threw something awful together. Probably somebody who knows a thing or two about that domain could give you some more principled advice here. I'm going to parse the grammar:

tree -> 0 N list
list -> 0 tree list | 1

My parser state will keep track of what "hole" we're currently parsing. For non-terminals, we'll actually need a stack of the holes. Thus, the terminal holes have one of these forms:

* N list
0 * list
* tree list
*

We'll collapse the last two. Notice that none of these holes have interesting information before them, so we don't need to store anything in a THole. The non-terminal holes have one of these forms:

0 N *
0 * list
0 tree *

In this case, the hole in the tree formation rule has a number before it that we will need later, and the second kind of hole in the list formation rule has a tree before it that we'll need to keep, so NTHole will need those in the constructors. Thus:

data Tree = Node Int [Tree]
    deriving (Eq, Ord, Read, Show)

data THole
    = TreeT0
    | TreeT1
    | ListT
    deriving (Eq, Ord, Read, Show)

data NTHole
    = TreeNT Int
    | ListNT0
    | ListNT1 Tree
    deriving (Eq, Ord, Read, Show)

Our current parser state will be the terminal hole we're currently in, and the stack of non-terminal holes we need to fill as rules get reduced.

type RawState = (THole, List NTHole)
initRawState = (TreeT0, nilVal)

...well, except that we have two more states of interest: done with the list, and error.

type State = Maybe (Either RawState Tree)
initState = Just (Left initRawState)

Now we can write a step function that takes a good state and handles it. Again, you would want a parser generator tool to create one of these for you, but this language is small enough that I did it by hand.

stepRaw :: Int -> RawState -> State
stepRaw 0 (TreeT0, xs) = Just (Left (TreeT1, xs))
stepRaw n (TreeT1, xs) = Just (Left (ListT , consVal (TreeNT n) xs))
stepRaw 0 (ListT , xs) = Just (Left (TreeT0, consVal ListNT0    xs))
stepRaw 1 (ListT , xs) = fst (runList xs cons nil) [] where
    cons v (f, xs) = flip (,) (consVal v xs) $ case v of
        ListNT1 t -> \acc -> f (t:acc)
        TreeNT  n -> \acc -> let t = Node n acc in case uncons xs of
            Nothing -> Just (Right t)
            Just (ListNT0, xs) -> Just (Left (ListT, consVal (ListNT1 t) xs))
            _ -> Nothing
        _ -> \acc -> Nothing
    nil = (\acc -> Nothing, nilVal)
stepRaw _ _ = Nothing

step :: Int -> State -> State
step n v = v >>= either (stepRaw n) (const Nothing)

It turns out that this parser actually runs backwards, which is unfortunate but not a fundamental limitation. It was just easier for me to think in this direction. As required, there is no recursion here. We can try it out on your sample List Int in ghci.

*Main> let x = foldr consVal nilVal [1,1,40,0,0,1,30,0,0,1,20,0,0,10,0]
*Main> runList x step initState
Just (Right (Node 10 [Node 20 [],Node 30 [],Node 40 []]))

I use foldr to build x, and foldr is recursive, so you might scream about that. But we could easily define x without foldr; the built-in list syntax is just more convenient to read and write than long chains of consVal and nilVal.

How to write a parser for a mutually recursive ADT without recursion and side effects?

1 Answers