In a LALR(1) parser, the rules in the grammar are converted into a parse table that effectively says "If you have this input so far, and the lookahead token is X, then shift to state Y, or reduce by rule R".
I have successfully constructed a LALR(1) parser in an interpreted language (ruby), not using a generator, but computing a parse table at runtime and evaluating the input using that parse table. This works surprisingly well and the table generation is quite trivial (which surprised me somewhat), supporting self-referential rules and left/right association.
One thing I am having some difficulty to understand, however, is how yacc/bison conceptually processes empty rule definitions. My parser can't handle them, since in generating the table it looks at each symbol in each rule, recursively, and "empty" is just not something that will come from the lexer, nor be reduced by a rule. So how then, do LALR(1) parsers process the empty rule? Do they treat it specially, or is it a "natural" concept that a valid algorithm should just work with, without even needing to have particular awareness of such a concept?
Let's say, a rule that can match any number of paired parentheses with nothing in the middle:
expr: /* empty */
| '(' expr ')'
;
Input like the following would match this rule:
((((()))))
This means that upon reading '(' and seeing ')' in the lookahead token, the parser choices:
- Shift the ')' (not possible)
- Reduce the input according to some other rule (not possible)
- Something else...
don't quite fit into the core algorithm of "shift" or "reduce". The parser effectively needs to shift nothing onto the stack, reduce "nothing" to expr
, then shift the next token ')'
, giving '(' expr ')'
, which of course reduces to expr
, and so on.
It's the "shift nothing" that's confusing me. How does the parse table convey such a concept? Consider also that it should be possible to invoke some semantic action that returns a value to $$
on reducing the empty value, so a rather simplistic view of just skipping that from the parse table and saying that '('
on the stack and ')'
in the lookahead should simply translate to a shift, would not genuinely produce the sequence '(' expr ')'
, but would simply produce the sequence '(' ')'
.