Parsing simple molecule names with Attoparsec

Question

I find it extremely difficult to learn how to use Attoparsec, because the documentation is really just an API documentation and there are basically no tutorials around (except the one from FPComplete). If you know other places where I can learn Attoparsec, that'd be great.

I have to parse simple molecule names, in the following format: NaCl, CO2, H2O, HCN, H2O2.
An element name is an uppercase letter optionally followed by a lowercase one (I'm not considering those elements with a symbol longer than 2 characters).
An element can be followed by a number (that would be the subscript in a formula).

New version (thanks to Mark's and Tarmil's suggestions), which compiles but does not parse:

module Chem
    where

import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text

data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]

parseString :: String -> Result Molecule
parseString = parse (many' parseElement) . pack

parseElement :: Parser Element
parseElement = do
    el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
    n  <- option 1 decimal
    return $ Element el n

pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)

Any suggestion is appreciated.

EDIT: I managed to get it running. Basically, a Partial continuation was returned, and to finish the parsing it's necessary to feed the parser with an empty bytestring. So the correct parseString would be:

parseString = flip feed empty . parse (many' parseElement) . pack

where empty is Data.Text.empty. However, since I don't need incremental parsing there is the useful function parseOnly, which does not wait for more input and returns an Either.

With that in mind, I rewrote the code like this (it works now):

module Chem
    where

import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text

data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]

parseString :: String -> Either String Molecule
parseString = parseOnly (many' parseElement) . pack

parseElement :: Parser Element
parseElement = do
    el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
    n  <- option 1 decimal
    return $ Element el n

pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)

Information is supposed to be subscript? I'm not really good at chemistry, but if there is no subscript, 1 is implied. So, what do you think about something like data Element = Element String Int (if there is no number you can use 1 in constructor), and then you can write type Molecule = [Element]. And your parser will be something like Parser Molecule... — Mark Karpov
@Mark: Yes, it would be 1. I didn't consider things this way, it seems very reasonable! — rubik
Check out the functions parseOnly and parseTest. Unless you are trying to parse something huge or streaming parseOnly is the way to go. parseTest is good for checking how your parser will work and helps you detect the Partial thing if you do want to eventually use parse which is almost never. parseWith is best for huge streaming things where you want to supply more data. — Michael Fox

Tarmil Tarmil · Accepted Answer · 2014-10-25T08:55:28

You have two problems in the letters parsing part:

inClass is not a parser, it is a function that is meant to be passed to satisfy.
<*> has type Parser (a -> b) -> Parser a -> Parser b, so the parser on the left should return a function. Typically, it is used like this:
```
pf <$> p1 <*> p2 <*> ... <*> pn
```
where pf is a function with n arguments.

So here you probably want something like this:

-- parse a character in the given class, and transform it to a single-char string
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)

-- ...
    el <- ((++) <$> pClass "A-Z" <*> pClass "a-z") <|> pClass "A-Z"
-- ...

I think this would be enhanced by using option, instead of duplicating the A-Z parser:

    el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")

Parsing simple molecule names with Attoparsec

1 Answers