I find it extremely difficult to learn how to use Attoparsec, because the documentation is really just an API documentation and there are basically no tutorials around (except the one from FPComplete). If you know other places where I can learn Attoparsec, that'd be great.
I have to parse simple molecule names, in the following format: NaCl
, CO2
, H2O
, HCN
, H2O2
.
An element name is an uppercase letter optionally followed by a lowercase one (I'm not considering those elements with a symbol longer than 2 characters).
An element can be followed by a number (that would be the subscript in a formula).
New version (thanks to Mark's and Tarmil's suggestions), which compiles but does not parse:
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Result Molecule
parseString = parse (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
Any suggestion is appreciated.
EDIT: I managed to get it running. Basically, a Partial
continuation was returned, and to finish the parsing it's necessary to feed the parser with an empty bytestring. So the correct parseString
would be:
parseString = flip feed empty . parse (many' parseElement) . pack
where empty
is Data.Text.empty
. However, since I don't need incremental parsing there is the useful function parseOnly
, which does not wait for more input and returns an Either
.
With that in mind, I rewrote the code like this (it works now):
module Chem
where
import Data.Text (Text, pack)
import Control.Applicative ((<*>), (<$>))
import Data.Attoparsec.Text
data Element = Element String Int deriving (Eq, Ord, Show)
type Molecule = [Element]
parseString :: String -> Either String Molecule
parseString = parseOnly (many' parseElement) . pack
parseElement :: Parser Element
parseElement = do
el <- (++) <$> pClass "A-Z" <*> option "" (pClass "a-z")
n <- option 1 decimal
return $ Element el n
pClass :: String -> Parser String
pClass cls = (\c -> [c]) <$> satisfy (inClass cls)
Information
is supposed to be subscript? I'm not really good at chemistry, but if there is no subscript, 1 is implied. So, what do you think about something likedata Element = Element String Int
(if there is no number you can use1
in constructor), and then you can writetype Molecule = [Element]
. And your parser will be something likeParser Molecule
... – Mark Karpov1
. I didn't consider things this way, it seems very reasonable! – rubikparseOnly
andparseTest
. Unless you are trying to parse something huge or streamingparseOnly
is the way to go.parseTest
is good for checking how your parser will work and helps you detect thePartial
thing if you do want to eventually useparse
which is almost never.parseWith
is best for huge streaming things where you want to supply more data. – Michael Fox