3
votes

I'm taking a course on compilers at my university. I'm choosing to do the project using Haskell + Parsec. The lexer and parser are required to be separate. I'm using Parsec to convert a string into a list of tokens, which will then be passed onto another Parsec parser that converts a list of tokens into an AST.

The problem is that the lexer is supposed to continue attempting to lex, even in the case of an error. To try to do this, I introduced a token representing an "unexpected token" to my Token data type, and I tried to sprinkle my code with <|> unexpected in order to produce this token in case of an error. This is a lot of boilerplate, and also can be difficult to reason about where to place these.

My preferred solution would be to somehow have Parsec automatically do this: If there's ever a ParseError, produce an unexpected token at that position, and continue parsing one position later. How would I do this?

Here is a snippet of some of the code I have now: http://lpaste.net/8144414997276000256 For some reason I can still get a parse error, even though the Unexpected token should be catching the non-handled cases.

1

1 Answers

1
votes

It seems like you should be able to get away with having a single extra unexpected term. I assume you have a token type that looks something like this:

token' =  number
      <|> identifier
      <|> ...

I'd probably have each token (number, identifier... etc) manage its own whitespace:

number :: Parser Token
number = Number . read <$> many1 digit <* spaces

Why not add your extra unexpected term as a catch-all at the end of this?

token' =  number
      <|> identifier
      <|> ...
      <|> unexpected'

Have it consume a single character. For better error messages, you could even include the character in the value. Then, when you use this to create a list, you will get an Unexpected value for each character your lexer doesn't know what to do with.

unexpected' :: Parser Token
unexpected' = Unexpected <$ anyChar

Finally, the whole lex is just a many token'. In my tests, this deals fine with invalid characters in the middle.

*Main> parse (many token') "<foo>" "1 2 abc ~ ~def"
Right [Number 1,Number 2,Identifier "abc",Unexpected,Unexpected,Unexpected,Identifier "def"]

One thing to keep in mind is that Parsec does not backtrack by default. This means that if a parse fails part of the way through a token, it will not go back and try unexpected instead: you will just get an error. To enable backtracking, you have to use try on the parser that might error. For example, if identifier needs two characters:

identifier :: Parser Token
identifier = Identifier <$> liftA2 (:) letter (many1 alphaNum) <* spaces

Then it might fail part of the way through and not backtrack. But if you wrap it in a try, it should work:

token' =  number
      <|> try identifier
      <|> ...

The problem with try is that it can slow your code down if you aren't careful. However, if you don't mind the slowdown, you might be able to get away with just adding try everywhere and backtracking a lot!