78
votes

I have to parse some files and convert them to some predefined datatypes.

Haskell seems to be providing two packages for that:

  1. attoparsec
  2. parsec

What is the difference between the two of them and which one is better suited for parsing a text file according to some rules?

1
They're roughly equivalent. attoparsec is faster, but parsec is likely installed by default, and may therefore be more convenient.sanityinc
The documentation for module Data.Attoparsec.ByteString has a comparison between Parsec and Attoparsec: hackage.haskell.org/package/attoparsec-0.10.4.0/docs/…danidiaz
I'd just like to mention that Haskell provides many more than two packages for parsing, and you're missing several very good ones, in particular uu-parsinglib and polyparse.John L
@JohnL Thanks, didn't know that.Sibi
There is also the parsec-fork megaparsec now: mail.haskell.org/pipermail/haskell-cafe/2015-September/…unhammer

1 Answers

144
votes

Parsec

Parsec is good for "user-facing" parsers: things where you have a bounded amount of input but error messages matter. It's not terribly fast, but if you have small inputs this shouldn't matter. For example, I would choose Parsec for virtually any programming language tools since--in absolute terms--even the largest source files are not that big but error messages really matter.

Parsec can work on different input types, which means you can use it with a standard String or with a stream of tokens from an external lexer of some sort. Since it can use String, it handles Unicode perfectly well for you; the built-in basic parsers like digit and letter are Unicode-aware.

Parsec also comes with a monad transformer, which means you can layer it in a monad stack. This could be useful if you want to keep track of additional state during your parse, for example. You could also go for more trippy effects like non-deterministic parsing, or something--the usual magic of monad transformers.

Attoparsec

Attoparsec is much faster than Parsec. You should use it when you expect to get large amounts of input or performance really matters. It's great for things like networking code (parsing packet structure), parsing large amounts of raw data or working with binary file formats.

Attoparsec can work with ByteStrings, which are binary data. This makes it a good choice for implementing things like binary file formats. However, the since this is for binary data, it does not handle things like text encoding; for that, you should use the attoparsec module for Text.

Attoparsec supports incremental parsing, which Parsec does not. This is very important for certain applications like networking code, but doesn't matter for others.

Attorparsec has worse error messages than Parsec and sacrifices some high-level features for performance. It's specialized to Text or ByteString, so you can't use it with tokens from a custom lexer. It also isn't a monad transformer.

Which One?

Ultimately, Parsec and Attoparsec cater to very different niches. The high-level difference is performance: if you need it, choose Attoparsec; if you don't, just go with Parsec.

My usual heuristic is choosing Parsec for programming languages, configuration file formats and user input as well as almost anything I would otherwise do with a regex. These are things usually produced by hand, so the parsers do not need to scale but they do need to report errors well.

On the other hand, I would choose Attoparsec for things like implementing network protocols, dealing with binary data and file formats or reading in large amounts of automatically generated data. Things where you're dealing with time constraints or large amounts of data, that are usually not directly written by a human.

As you see, the choice is actually often pretty simple: the use cases don't overlap very much. Chances are, it'll be pretty clear which one to use for any given application.