Questions about grammar / parser theory

Question

I have recently finished my first parser/compiler for a special-purpose programming language. I had some freedom in the specification of this language and in some places I tweaked the spec to make it easier to parse; which has, in some places, resulted in some (so far) minor points for improvement in the language itself. Maybe I'll get to fix those in some future v2; for now I'm still digesting the things I've learned from this process and I have a few questions about this. I never took a formal course on compiler design so I want verify that my understanding is in line with the state of the art in this field.

Most articles and textbooks divide the parsing process into two stages: tokenizing and lexical analysis. Then it is usually mentioned that in most real-world implementations, the two are usually somewhat intertwined, to make implementation easier. My own experience has been opposite, in that mixing them makes things harder, and I even added a third pass. I do a first pass in which I do minimalistic tokenizing, by which I mean that I define 'token' in its most simple form. I experimented with approaches in which, for example, a constructor as a whole was a 'token'. I moved away from that approach and now only call 'tokens' things that are the most basic building blocks of the program. Then I do a lexical analysis pass in which I build an AST. Then my third pass is (what I call) structural analysis, in which I do e.g. type checking, check if the number of arguments being passed to functions match those functions' signatures, etc. Would this normally not be part of 'parsing', or be put under 'lexical analysis', or why does the literature propose mostly two-pass parsing approaches? (arguably I have a fourth pass, the code generation, but that's largely a separate process).
My lexer is (I think) recursive-descent - I have routines that match token streams or return false when they can't, and then try match another 'rule', all the way until all tokens are processed. These routines look ahead at most 2 tokens; as I understand it, this means I have a 'LL(2) grammar'. My question is: what does this number matter? Are 'LL(1) grammars' somehow 'better' than those with a higher number? Also, why is backtracking undesirable? (or isn't it, and am I just misunderstanding?)
Why is the categorization of grammars such a big deal? All texts start with a (usually quite extensive) explanation of them, but as I understand it, you can't really change the properties of a grammar without making changes to the way the language you're designing will work. Now I understand that if you want to parse an existing language, classifying the structure of it will enable you to prove the complexity class of the resulting parser; but when designing a new language, (imo) the most important thing is expressiveness or other sort of fitness for the purpose at hand, and the performance of the parser for it is of secondary importance. While I realize that this last part is debatable, are there practical reasons, when designing a new language, to carefully watch the type of the grammar?

Paolo Capriotti Paolo Capriotti · Accepted Answer · 2011-12-27T11:30:44

I think you are confusing the terminology a little. Parsing only deals with going from source to AST, and is divided in "stages", not "passes". Using the word "pass" implies that they are performed sequentially, which is in general not true.

The reason why in practice lexical analysis (i.e. tokenization) and actual parsing are intertwined is because of syntactic peculiarities in the target language: the syntax of many real languages is better described by defining "context sensitive" tokens. Think of languages with string interpolation, or user definable operators.
The parse table of an LL(k) grammars grows exponentially with k in the worst case, I suppose that's what got them a bad name (for k > 1). If you are manually implementing a recursive descent parser, the lookahead doesn't really matter. Backtracking is still bad, however, since it can cause exponential runtime.
You can change a grammar without affecting the target language. Ensuring that a grammar falls within a certain class allows you to use existing tools (like LALR parser generators), and gives you some information on the runtime characteristics of the parser. I do agree that language design shouldn't be guided by parsing considerations, although a little common sense does help or you'll end up with an unparsable jumble like C++.

Questions about grammar / parser theory

1 Answers