Constructing a tuple of length n from a sequence of parser combinators

Question

Question. Is there a way to define an operator such that a sequence of values separated by that operator yields a flat tuple?

I have struggled to find a concise phrasing of my question, so read on for details and examples...

Description

I'm writing myself a few helper operators for Parsec, starting with the following:

(<@) :: GenParser tok st a -> (a -> b) -> GenParser tok st b
p <@ f = do {
    x <- p ;
    return (f x)
}

(<>) :: GenParser tok st a -> GenParser tok st b -> GenParser tok st (a, b)
p <> q = do {
    x <- p ;
    y <- q ;
    return (x, y)
 }

These work as follows:

parseTest ((many upper) <@ length) "HELLO"
5

parseTest (many upper) <> (many lower) "ABCdef"
("ABC", "def")

Unfortunately, a sequence of parsers separated by <> will result in a nested tuple, e.g.:

parseTest (subject <> verb <> object) "edmund learns haskell"
(("edmund", "learns"), "haskell")

Instead of the relatively more cromulent:

("edmund", "learns", "haskell")

I'm looking for a way to define <> so that

p1 :: GenParser tok st a ; p2 :: GenParser tok st b ; p3 :: GenParser tok st c
p1 <> p2 :: GenParser tok st (a, b)
p1 <> p2 <> p3:: GenParser tok st (a, b, c)
...

I do not think I have ever seen a Haskell program where a tuple type of length n (known at compile time) is constructed like this. And I suspect it may be difficult to define the operator with both types:

GenParser tok st a -> GenParser tok st b) -> GenParser tok st (a, b)
GenParser tok st (a, b) -> GenParser tok st c) -> GenParser tok st (a, b, c)

-- how does one tell, at compile time, the difference between a tuple resulting from <>, and one which is simply the intended return type from any other parser? I can only speculate that additional syntax would be required.

So, I am not at all sure it's a good idea or even possible. I would be curious to know how to do it even if it is not a good idea for my case (and I would love to know how to if it's impossible!).

Followup question (if this crazy scheme is possible). How could one annotate one item in a chain of <>s to be left out of the result tuple?

For example, assuming a postfix annotation <#:

p1 :: GenParser tok st a
p2 :: GenParser tok st b
p1 <> keyword "is" <# <> p2 :: GenParser tok st (a, b)

Background

Circa 2006 I learned about parser combinators at university. We used a library in which the <@, and I believe the <> operator, were present and performed similarly to my attempts. I do not know what this library was; it may have been written by a graduate student for teaching our class. In any case, it does not seem to be either Parsec nor the base parser combinators in `Text.Parser.Combinators.

Bonus question. What is the difference between the base parser combinators in Text.ParserCombinators.ReadP/ReadPrec and the ones in Parsec?

I seem to recall this library was also nondeterministic, with each parser invocation returning the set of possible parses and the remaining unparsed input for each. (A successful, complete, unambiguous parse would result in [(parseresult, "")].)

Final question. If this sounds like something you've heard of, could you let me know what it was (for nostalgia's sake) ?

Well done for thinking right and provoking an excellent answer. — AndrewC

icktoofay icktoofay · Accepted Answer · 2014-06-22T02:18:10

Functor

I’d like to direct your attention to something:

(<@)      :: GenParser tok st a -> (a -> b) -> GenParser tok st b
flip fmap :: (Functor f) => f a -> (a -> b) -> f                b

Do you notice a similarity? If we replace f with GenParser tok st in the type of flip fmap, we get the type of (<@). Furthermore, this is not theoretical: GenParser tok st is an instance of Functor. fmap is also available in operator form, named <$>. So, we can rewrite your code in two ways (original first):

ghci> parseTest ((many upper) <@ length) "HELLO"
5
ghci> parseTest (fmap length (many upper)) "HELLO"
5
ghci> parseTest (length <$> (many upper)) "HELLO"
5

Applicative

Functors are nice, but they aren’t powerful enough to subsume your second example. Fortunately, there is another typeclass that is: Applicative. Now, Applicative does not have a function to form a monadic action yielding a pair out of two monadic actions, but it does provide some helpful building blocks. In particular, it provides <*>:

(<*>) :: f (a -> b) -> f a -> f b

It turns out that we can compose this together with <$> to rewrite your second example:

ghci> parseTest (many upper) <> (many lower) "ABCdef"
("ABC", "def")
ghci> parseTest ((,) <$> many upper <*> many lower) "ABCdef"
("ABC", "def")

If you’re not familiar with the syntax, (,) is the function that creates a pair; it has type a -> b -> (a, b).

But Applicative does not stop there. You were looking for a way to flatten a tuple; but rather than creating nested tuples and flattening them out, you can use Applicative to create a triple directly:

ghci> parseTest ((,,) <$> subject <*> verb <*> object) "edmund learns haskell"
("edmund", "learns", "haskell")

And it just happens that Applicative has another operator to help you with your last request, too: <* and *>, which ignore the result of the second or the result of the first operand, respectively. So if you want to ignore the verb:

ghci> parseTest ((,) <$> subject <* verb <*> object) "edmund learns haskell"
("edmund", "haskell")

Bonus question

If I recall correctly, ReadP allows backtracking at each step; Parsec, on the other hand, does not allow backtracking unless you directly or indirectly annotate it in your parser using the try combinator or some other combinator that uses that combinator. As such, Parsec parsers do not backtrack as much, and can have better worst-case performance.

Appendix: Implementing your `<>`…almost ^{_{(not for the faint of heart)}}

I don’t know of any way to implement your exact <>, or to make it work for all sizes of tuples, but if you’re willing to enable a few Haskell extensions, you can eliminate the need for counting, working up to an arbitrary but fixed size of tuples. First, we’ll want to have a type for a 1-tuple:

newtype One a = One a deriving (Eq, Read, Show)  -- derive more if you want

Now, we want a function with these types:

(<>) :: (Applicative f) => f ()        -> f a -> f (One a)
(<>) :: (Applicative f) => f (One a)   -> f b -> f (a, b)
(<>) :: (Applicative f) => f (a, b)    -> f c -> f (a, b, c)
(<>) :: (Applicative f) => f (a, b, c) -> f d -> f (a, b, c, d)
-- ...

How do we have a function with multiple types in Haskell? Typeclasses, of course! But an ordinary typeclass won’t do it: we need functional dependencies. Also, I can’t think of an appropriate name, so I’ll just call it C. ^{_{(If you can think of a better name, tell me in the comments and I’ll edit.)}}

{-# LANGUAGE FunctionalDependencies #-}
class C a b c | a b -> c, c -> a b where
    (<>) :: (Applicative f) => f a -> f b -> f c

Then, to actually implement our instances, we need FlexibleInstances:

{-# LANGUAGE FlexibleInstances #-}
instance C ()        a (One a)      where (<>) _ = fmap One
instance C (One a)   b (a, b)       where (<>) = liftA2 $ \(One a)   b -> (a, b)
instance C (a, b)    c (a, b, c)    where (<>) = liftA2 $ \(a, b)    c -> (a, b, c)
instance C (a, b, c) d (a, b, c, d) where (<>) = liftA2 $ \(a, b, c) d -> (a, b, c, d)
-- ...

Now you can write your parsers like this:

parseTest (return () <> subject <> verb <> object) "edmund learns haskell"
("edmund", "learns", "haskell")

We did have to write return () <> before it, which is undesirable. You could retain your prior implementation of <> and rename our new implementation to <+>, and then you could write your parsers like this:

parseTest (subject <+> verb <> object) "edmund learns haskell"
("edmund", "learns", "haskell")

(<+> used to join all parsers after the first two) There might be a way to make a single operator do it using OverlappingInstances, but the obvious solution violates the functional dependencies.

The question that needs to be asked

If you’re considering using this latter approach, I advise you not to. First, if you’re going to use tuples, it’s not that difficult to count the number of items you want to parse. Second, you often won’t be using tuples in the first place, and this approach only works if you are trying to get tuples. But what makes more sense than a tuple? Well, an AST node. For example, if you were writing a parser for a programming language, you might have a type like this:

data Expression = ...
                | IfExpression Expression Expression Expression
                | ...
                deriving (...)

Then, to parse it, you can use Applicative:

parseIfExpression = IfExpression
                <$> keyword "if"    *> expression
                <*  keyword "then" <*> expression
                <*  keyword "else" <*> expression

Here, we don’t need to count the number of items; it’s encapsulated in the type of the IfExpression type constructor. Since normally you’re going to be parsing ASTs, tuples are going to be irrelevant, and so such a complex solution seems unjustified, particularly since the alternative when you have to use tuples is so uncomplicated (counting the number of items and inserting the appropriate tuple constructor).