2
votes

I am trying to split my input into those parts that match a certain pattern and the rest, let's say

data Data = A Int | B Char | C String
parseDatas :: Parsec [Token] () a [Data]

I already have written two more-or-less complicated parsers

parseA :: Parsec [Token] () Data
parseB :: Parsec [Token] () Data

that match the things I am looking for. Now the obvious solution is

parseDatas = many (parseA <|> parseB <|> parseC)

where the parser for the intermediate parts would look like this:

makeC :: [Token] -> Data
makeC = C . concatMap show -- or something like this
parseC :: Parsec [Token] () Data
parseC = makeC <$> many anyToken

Meh, that throws a runtime [ERROR] Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string. - ok, easily fixed:

parseC = makeC <$> many1 anyToken

But now parseC consumes the entire input (that starts with something I'm not looking for), ignoring any patterns that should yield an A or B!

If my patterns were regexes1, I would now have changed the + operator to the non-greedy +? operator. How can I do the same for the many1 parser combinator?

1: which I cannot use, as I'm operating on tokens not characters


A solution I found was

parseC = makeC <$> many1 (notFollowedBy (parseA <|> parseB) >> anyToken)

but that does look, uh, suboptimal. It's not really generic. There must be something better.

I also had a look at Parsec how to find "matches" within a string where the suggestion was to define a recursive parser, but that looks like a hazzle if I don't want to drop the intermediate tokens and collect them in a list instead.

3

3 Answers

3
votes

You could let parseC consume exactly one token at a time:

parseDatas = many $ parseA <|> parseB <|> (C . show <$> anyToken)

and then, if you want, group adjacent Cs into one to conserve semantics:

groupCs (C c) (C c':xs) = C (c ++ c') : xs
groupCs x xs = x : xs
parseDatas = foldr groupCs [] <$> many (parseA <|> parseB <|> (C . show <$> anyToken))

If you want to apply some operation make :: [Token] -> String on consecutive Cs:

data Data c = A Int | B Char | C c deriving Functor

groupCs :: [Data a] -> [Data [a]] -> [Data [a]]
groupCs (C c) (C cs:xs) = C (c:cs) : xs
groupCs (C c) xs = C [c] : xs
groupCs x xs = x : xs

parseDatas = (map.fmap) make . foldr groupCs [] <$> many (parseA <|> parseB <|> (C <$> anyToken))
1
votes

The sepCap parser combinator from replace-megaparsec can split a string into those parts that match a certain pattern and the rest.

Try this:

sepCap (parseA <|> parseB)

Or, if parseA and parseB are parsers for different types of things, then maybe use eitherP like:

sepCap (eitherP parseA parseB)
0
votes

Also, for the general question:

How do we repeat a “non-greedy” pattern?

…like in regex where we would repeat pattern p non-greedily by writing p*??

The answer is:

manyTill_ p q where q is the entire rest of the parser.

https://hackage.haskell.org/package/parser-combinators/docs/Control-Monad-Combinators.html#v:manyTill_

Or, for p+?, it would be someTill_ p q where q is the entire rest of the parser.

https://hackage.haskell.org/package/parser-combinators/docs/Control-Monad-Combinators.html#v:someTill_