3
votes

From this question, a grammar for expressions involving binary operators (+ - * /) which disallows outer parentheses:

top_level   : expression PLUS term
            | expression MINUS term
            | term TIMES factor
            | term DIVIDE factor
            | NUMBER
expression  : expression PLUS term
            | expression MINUS term
            | term
term        : term TIMES factor
            | term DIVIDE factor
            | factor
factor      : NUMBER
            | LPAREN expression RPAREN

This grammar is LALR(1). I have therefore been able to use PLY (a Python implementation of yacc) to create a bottom-up parser for the grammar.

For comparison purposes, I would now like to try building a top-down recursive-descent parser for the same language. I have transformed the grammar, removing left-recursion and applying left-factoring:

top_level   : expression top_level1
            | term top_level2
            | NUMBER
top_level1  : PLUS term
            | MINUS term
top_level2  : TIMES factor
            | DIVIDE factor
expression  : term expression1
expression1 : PLUS term expression1
            | MINUS term expression1
            | empty
term        : factor term1
term1       : TIMES factor term1
            | DIVIDE factor term1
            | empty
factor      : NUMBER
            | LPAREN expression RPAREN

Without the top_level rules this grammar is LL(1), so writing a recursive-descent parser would be fairly straightforward. Unfortunately, including top_level, the grammar is not LL(1).

  1. Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
  2. Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
  3. Is it possible to simplify this grammar to ease the recursive-descent approach?
2

2 Answers

5
votes

The grammar is not LL with finite lookahead, but the language is LL(1) because an LL(1) grammar exists. Pragmatically, a recursive descent parser is easy to write even without modifying the grammar.

  1. Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?

If α is a derivation of expression, β of term and γ of factor, then top_level can derive both the sentence (α)+β and the sentence (α)*γ (but it cannot derive the sentence (α).) However, (α) is a possible derivation of both expression and term, so it is impossible to decide which production of top_level to use until the symbol following the ) is encountered. Since α can be of arbitrary length, there is no k for which a lookahead of k is sufficient to distinguish the two productions. Some people might call that LL(∞), but that doesn't seem to be a very useful grammatical category to me. (LL(*) is, afaik, the name of a parsing strategy invented by Terence Parr, and not an accepted name for a class of grammars.) I would simply say that the grammar is not LL(k) for any k.

  1. Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)

Sure. It's not even that difficult.

The first symbol must either be a NUMBER or a (. If it is a NUMBER, we predict (call) expression. If it is (, we consume it, call expression, consume the following ) (or declare an error, if the next symbol is not a close parenthesis), and then either call expression1 or term1 and then expression1, depending on what the next symbol is. Again, if the next symbol doesn't match the FIRST set of either expression1 or term1, we declare a syntax error. Note that the above strategy does not require the top_level* productions at all.

Since that will clearly work without backtracking, it can serve as the basis for how to write an LL(1) grammar.

  1. Is it possible to simplify this grammar to ease the recursive-descent approach?

I'm not sure if the following grammar is any simpler, but it does correspond to the recursive descent parser described above.

top_level   : NUMBER optional_expression_or_term_1
            | LPAREN expression RPAREN expression_or_term_1
optional_expression_or_term_1: empty
            | expression_or_term_1
expression_or_term_1
            : PLUS term expression1
            | MINUS term expression1
            | TIMES factor term1 expression1
            | DIVIDE factor term1 expression1
expression  : term expression1
expression1 : PLUS term expression1
            | MINUS term expression1
            | empty
term        : factor term1
term1       : TIMES factor term1
            | DIVIDE factor term1
            | empty
factor      : NUMBER
            | LPAREN expression RPAREN

I'm left with two observations, both of which you are completely free to ignore (particularly the second one which is 100% opinion).

The first is that it seems odd to me to ban (1+2) but allow (((1)))+2, or ((1+2))+3. But no doubt you have your reasons. (Of course, you could easily ban the redundant double parentheses by replacing expression with top_level in the second production for factor.

Second, it seems to me that the hoop-jumping involved in the LL(1) grammar in the third section is just one more reason to ask why there is any reason to use LL grammars. The LR(1) grammar is easier to read, and its correspondence with the language's syntactic structure is clearer. The logic of the generated recursive-descent parser may be easier to understand, but to me that seems secondary.

2
votes

To make the grammar LL(1) you need to finish left-factoring top_level. You stopped at:

top_level   : expression top_level1
            | term top_level2
            | NUMBER

expression and term both have NUMBER in their FIRST sets, so they must first be substituted to left-factor:

top_level   : NUMBER term1 expression1 top_level1
            | NUMBER term1 top_level2
            | NUMBER
            | LPAREN expression RPAREN term1 expression1 top_level1
            | LPAREN expression RPAREN term1 top_level2

which you can then left-factor to

top_level   : NUMBER term1 top_level3
            | LPAREN expression RPAREN term1 top_level4

top_level3  : expression1 top_level1
            | top_level2
            | empty

top_level4  : expression1 top_level1
            | top_level2

Note that this still is not LL(1) as there are epsilon rules (term1, expression1) with overlapping FIRST and FOLLOW sets. So you need to factor those out too to make it LL(1)