6
votes

I'm writing a lexer (with re2c) and a parser (with Lemon) for a slightly convoluted data format: CSV-like, but with specific string types at specific places (alphanumeric chars only, alphanumeric chars and minus signs, any char except quotes and comma but with balanced braces, etc.), strings inside braces and strings that look like function calls with opening and closing braces that can contain parameters.

My first shot at it was a lexer with many states, each state catering to the specific string format. But after many unhelpful "unexpected input" messages from the lexer (which got very big) I realized that maybe it was trying to do the work of the parser. I scrapped my first try and went with a lexer with only one state, many character tokens and a parser that combines the tokens into the different string types. This works better, I get more helpful syntax errors from the parser when something is off, but it still feels not quite right. I am thinking of adding one or two states to the lexer, but initiating the states from the parser, which has a much better "overview" on which string type is required in a given instance. Overall I feel a bit stupid :(

I have no formal CS background and shy a bit away from the math-heavy theory. But maybe there is a tutorial or book somewhere that explains what a lexer should (and should not) do and which part of the work the parser should do. How to construct good token patterns, when to use lexer states, when and how to use recursive rules (with a LALR parser), how to avoid ambigous rules. A pragmatic cookbook that teaches the basics. The "Lex and YACC primer/HOWTO" was nice, but not enough. Since I just want to parse a data format, books on compiler building (like the red dragon book) look a bit oversized to me.

Or maybe someone can give me some plain rules here.

2

2 Answers

7
votes

What you should really do is write a grammar for your language. Once you have that, the boundary is easy:

  • The lexer is responsible for taking your input and telling you which terminal you have.
  • The parser is responsible for matching a series of terminals and nonterminals to a production rule, repeatedly, until you either have an Abstract Syntax Tree (AST) or a parse failure.

The lexer is not responsible for input validation except insofar as to reject impossible characters, and other very basic bits. The parser does all that.

Take a look at https://www.cs.rochester.edu/u/nelson/courses/csc_173/grammars/parsing.html . It's an intro CS course page on parsing.

5
votes

A good litmus test for deciding if something should be done by a parser or lexer is to ask yourself a question:

Does the syntax have any recursive, nested, self-similar elements?
(e.g. nested parentheses, braces, tags, subexpressions, subsentences etc.).

If not, plain regular expressions is enough and it could be done by the lexer.
If yes, it should be analysed by a parser, because it's a context-free grammar at the very least.

Lexer is generally for finding "words" of your language, and classifying them (is it a noun? a verb? an adjective? etc.).
Parser is for finding proper "sentences", structurizing them an finding if they're proper sentences in a given language.