Works for me:
Prelude Text.ParserCombinators.Parsec> let authorName = do { name <- many1 (noneOf ",:\r\n\8212\8213"); many (oneOf ",:-\8212\8213"); }
Prelude Text.ParserCombinators.Parsec> parse authorName "" "my Name,\8212::-:\8213,"
Right ",\8212::-:\8213,"
How did you try?
The above was using plain String
, which works without problems because a Char
is a full uncode code point. It's not as nice with other types of input stream. Text
will probably also work well for this example, I think that the dashes are encoded as a single code unit there. For ByteString
, however, things are more complicated. If you're using plain Data.ByteString.Char8
(strict or lazy, doesn't matter), the Char
s get truncated on packing, only the least significant 8 bits are retained, so '\8212' becomes 20 and '\8213' becomes 21. If the input stream is constructed the same way, that still kind of works, only all Char
s congruent to 20 or 21 modulo 256 will be mapped to the same as one of the dashes.
However, it is likely that the input stream is UTF-8 encoded, then the dashes are encoded as three bytes each, "\226\128\148" resp. "\226\128\149", which doesn't match what you get by truncating. Trying to parse utf-8 encoded text with ByteString
and parsec
is a bit more involved, the units of which the parse result is composed are not single bytes, but sequences of bytes, 1-4 in length each.
To use noneOf
, you need an
instance Text.Parsec.Prim.Stream ByteString m Char
which does the right thing. The instance provided in Text.Parsec.ByteString[.Lazy]
doesn't, it uses the Data.ByteString[.Lazy].Char8
interface, so an en-dash would become a single '\20' not matching '\8212' or produce three Chars
, '\226', '\128' and '\148' in three successive calls to uncons
, none of which matches '\8212' either, depending on how the input was encoded.