Parsec match single unicode character

Question

I'm trying to create a parser (with parsec), that parses tokens, delimited by newlines, commas, semicolons and unicode dashes (ndash and mdash):

authorParser = do
    name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
    many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')

But the ndash-mdash (\8212, \8213) part never 'succeeds' and i'm getting invalid parse results.

How do i specify unicode dashes with char parser?

P.S. I've tried (chr 8212), (chr 8213) too. It doesn't helps.

ADDITION: It is better to use Data.Text. The switch from ByteStrings madness to Data.Text saved me a lot of time and 'source space' :)

I think the encoding problems should be a new question, not enough space to treat that in comments. — Daniel Fischer

Daniel Fischer Daniel Fischer · Accepted Answer · 2011-12-19T18:22:50

Works for me:

Prelude Text.ParserCombinators.Parsec> let authorName = do { name <- many1 (noneOf ",:\r\n\8212\8213"); many (oneOf ",:-\8212\8213"); }
Prelude Text.ParserCombinators.Parsec> parse authorName "" "my Name,\8212::-:\8213,"
Right ",\8212::-:\8213,"

How did you try?

The above was using plain String, which works without problems because a Char is a full uncode code point. It's not as nice with other types of input stream. Text will probably also work well for this example, I think that the dashes are encoded as a single code unit there. For ByteString, however, things are more complicated. If you're using plain Data.ByteString.Char8 (strict or lazy, doesn't matter), the Chars get truncated on packing, only the least significant 8 bits are retained, so '\8212' becomes 20 and '\8213' becomes 21. If the input stream is constructed the same way, that still kind of works, only all Chars congruent to 20 or 21 modulo 256 will be mapped to the same as one of the dashes.

However, it is likely that the input stream is UTF-8 encoded, then the dashes are encoded as three bytes each, "\226\128\148" resp. "\226\128\149", which doesn't match what you get by truncating. Trying to parse utf-8 encoded text with ByteString and parsec is a bit more involved, the units of which the parse result is composed are not single bytes, but sequences of bytes, 1-4 in length each.

To use noneOf, you need an

instance Text.Parsec.Prim.Stream ByteString m Char

which does the right thing. The instance provided in Text.Parsec.ByteString[.Lazy] doesn't, it uses the Data.ByteString[.Lazy].Char8 interface, so an en-dash would become a single '\20' not matching '\8212' or produce three Chars, '\226', '\128' and '\148' in three successive calls to uncons, none of which matches '\8212' either, depending on how the input was encoded.

Parsec match single unicode character

1 Answers