6
votes

I'm new to Parsec (and to parsers in general), and I'm having some trouble with this parser I wrote:

list = char '(' *> many (spaces *> some letter) <* spaces <* char ')'

The idea is to parse lists in this format (I'm working up to s-expressions):

(firstElement secondElement thirdElement and so on)

I wrote this code to test it:

import Control.Applicative
import Text.ParserCombinators.Parsec hiding (many)

list = char '(' *> many (spaces *> some letter) <* spaces <* char ')'

test s = do
  putStrLn $ "Testing " ++ show s ++ ":"
  parseTest list s
  putStrLn ""

main = do
  test "()"
  test "(hello)"
  test "(hello world)"
  test "( hello world)"
  test "(hello world )"
  test "( )"

This is the output I get:

Testing "()":
[]

Testing "(hello)":
["hello"]

Testing "(hello world)":
["hello","world"]

Testing "( hello world)":
["hello","world"]

Testing "(hello world )":
parse error at (line 1, column 14):
unexpected ")"
expecting space or letter

Testing "( )":
parse error at (line 1, column 3):
unexpected ")"
expecting space or letter

As you can see, it fails when there is white space between the last element of the list, and the closing ). I don't understand why the white space isn't consumed by the spaces I put in just before <* char ')'. What silly mistake have I made?

3

3 Answers

12
votes

The problem is that the final spaces are consumed by the spaces in the argument to many,

list = char '(' *> many (spaces *> some letter) <* spaces <* char ')'
                     --  ^^^^^^ that one

and then the parser expects some letter but finds a closing parenthesis and thus fails.

To solve it, consume spaces only after tokens,

list = char '(' *> spaces *> many (some letter <* spaces) <* char ')'

that works as desired:

$ runghc lisplists.hs 
Testing "()":
[]

Testing "(hello)":
["hello"]

Testing "(hello world)":
["hello","world"]

Testing "( hello world)":
["hello","world"]

Testing "(hello world )":
["hello","world"]

Testing "( )":
[]
3
votes

The problem is that once the parser many (spaces *> some letter) sees a space it commits itself to parsing another item, since Parsec by default only looks ahead one character and doesn't backtrack.

The sledgehammer solution is to use try to enable backtracking, but problems like this are best avoided by simply parsing optional whitespace after each token instead, as seen in Daniel's answer.

0
votes

It's a bit tricky. Parsers by default are greedy. What it means in your case? When you try to parse (hello world ) you start from parsing (, then you are trying to match some spaces and identifier. So we do it. There are no spaces, but there is identifier. We are done. We try again with world. Now we have got _) remaining. You try parser (spaces *> some letter). It makes it greedy: so you match space and now you expect some letter, but you get ) instead. At this moment parser fails, but it already consumed space, so you are doomed. You can make this parser do backtracking by using try combinator: try (many (spaces *> some letter))