0
votes

I have a line-based text format I want to parse with Parsec†. A line either starts with a pound sign and specifies a key value pair separated by a colon or is a URL that is described by the previous tags.

Here's a short example:

#foo:bar
#faz:baz
https://example.com
#foo:beep
https://example.net

For simplicity's sake, I'm going to store everything as String. A Tag is a type Tag = (String, String), for example ("foo", "bar"). Ultimately, I'd like to group these as ([Tag], URL).

However, I struggle figuring out how to parse either [one or more tags] or [one URL].

My current approach looks like this:

import qualified System.Environment   as Env
import qualified Text.Megaparsec      as M
import qualified Text.Megaparsec.Text as M

type Tag = (String, String)

data Segment = Tags [Tag] | URL String
  deriving (Eq, Show)

tagP :: M.Parser Tag
tagP = M.char '#' *> ((,) <$> M.someTill M.printChar (M.char ':') <*> M.someTill M.printChar M.eol) M.<?> "Tag starting with #"

urlP :: M.Parser String
urlP = M.someTill M.printChar M.eol M.<?> "Some URL"

parser :: M.Parser Segment
parser = (Tags <$> M.many tagP) M.<|> (URL <$> urlP)

main :: IO ()
main = do
  fname <- head <$> Env.getArgs
  res <- M.parseFromFile (parser <* M.eof) fname
  print res

If I try to run this on the above sample, I get a parsing error like this:

3:1:
unexpected 'h'
expecting Tag starting with # or end of input

Clearly my use of many in combination with <|> is incorrect. Since the tag parser won't consume any input from the URL parser it cannot be related to backtracking. How do I need to change this to get to the desired result?

The full example is available on GitHub.


† I'm actually using MegaParsec here for better error messages but I think the problem is quite generic and not about any particular implementation of parser combinators.

2

2 Answers

1
votes

What you're doing works quite fine, only, at the moment you only parse a single segment (i.e., either only tags or only a URL), but that doesn't consume the whole input. It's eof that's causing the error.

Simply use one more many or some, to allow for multiple segments:

main :: IO ()
main = do
  fname <- head <$> Env.getArgs
  res <- M.parseFromFile (many parser <* M.eof) fname
  print res
0
votes

@cocreature answered this for me on Twitter.

As leftaroundabout pointed out here, there are two separate mistakes in my code:

  1. The parser itself misuses <|> while it should just sequentially parse the lines and skip to the next parser if it doesn't consume any input.
  2. The invocation (parseFromFile) only applies the parser function a single time and would fail as soon as it would get to the second block.

We can fix the parser and introduce grouping in one go:

parser :: M.Parser ([Tag], String)
parser = liftA2 (,) (M.many tagP) urlP

Afterwards, we just need to apply the change suggested by leftaroundabout:

...
res <- M.parseFromFile (M.many parser <* M.eof) fname

Running this leads to the desired result:

[([("foo","bar"),("faz","baz")],"https://example.com"),([("foo","beep")],"https://example.net")]