8
votes

In a LALR(1) parser, the rules in the grammar are converted into a parse table that effectively says "If you have this input so far, and the lookahead token is X, then shift to state Y, or reduce by rule R".

I have successfully constructed a LALR(1) parser in an interpreted language (ruby), not using a generator, but computing a parse table at runtime and evaluating the input using that parse table. This works surprisingly well and the table generation is quite trivial (which surprised me somewhat), supporting self-referential rules and left/right association.

One thing I am having some difficulty to understand, however, is how yacc/bison conceptually processes empty rule definitions. My parser can't handle them, since in generating the table it looks at each symbol in each rule, recursively, and "empty" is just not something that will come from the lexer, nor be reduced by a rule. So how then, do LALR(1) parsers process the empty rule? Do they treat it specially, or is it a "natural" concept that a valid algorithm should just work with, without even needing to have particular awareness of such a concept?

Let's say, a rule that can match any number of paired parentheses with nothing in the middle:

expr:   /* empty */
      | '(' expr ')'
      ;

Input like the following would match this rule:

((((()))))

This means that upon reading '(' and seeing ')' in the lookahead token, the parser choices:

  1. Shift the ')' (not possible)
  2. Reduce the input according to some other rule (not possible)
  3. Something else...

don't quite fit into the core algorithm of "shift" or "reduce". The parser effectively needs to shift nothing onto the stack, reduce "nothing" to expr, then shift the next token ')', giving '(' expr ')', which of course reduces to expr, and so on.

It's the "shift nothing" that's confusing me. How does the parse table convey such a concept? Consider also that it should be possible to invoke some semantic action that returns a value to $$ on reducing the empty value, so a rather simplistic view of just skipping that from the parse table and saying that '(' on the stack and ')' in the lookahead should simply translate to a shift, would not genuinely produce the sequence '(' expr ')', but would simply produce the sequence '(' ')'.

2
I'm sure there's long sections in the dragon book about dealing with such rules. I don't think Stack Overflow is the right venue to discuss it though - maybe programmers?Damien_The_Unbeliever
Thanks for the suggestion on the book... looking at that link now. Stackoverflow seems the right place to me. It's a direct question about the algorithm, not a subjective discussion. Somebody could very well search for this and, if anybody knows the answer, get a quick solution.d11wtq
I think I actually just figured this out when a fundamental point dawned on me, and it's so blindly obvious and straightforward. Will answer question, since Googling turns up nothing and it's a fairly natural question ;)d11wtq

2 Answers

9
votes

Despite thinking about this for days, since thinking this through when writing the question and in the minutes that followed, something just hit me as so incredibly obvious and simple.

Reduction by all rules is always: pop off X inputs from the stack, where X is the number of components in the rule, then shift the result back onto the stack and goto whatever state is given in the table after that reduction.

In the case of the empty rule, you don't need to consider that "empty" is even a concept. The parse table simply needs to include a transition that says "given '(' on the stack and 'anything that is not '(' in the lookahead, reduce by the 'empty' rule". Now since the empty rule has a size of zero components, popping zero from the stack means the stack doesn't change, then when the result of reducing nothing is shifted onto the stack, you're looking at something that does indeed appear in the grammar and everything becomes clear.

Stack       Lookahead    Remaining Input      Action
--------------------------------------------------------------
$           (            ())$                 Shift '('
$(          (            ))$                  Shift '('
$((         )            )$                   Reduce by /* empty */
$((expr     )            )$                   Shift ')'
$((expr)    )            $                    Reduce by '(' expr ')'
$(expr      )            $                    Shift ')'
$(expr)     $                                 Reduce by '(' expr ')'
$expr                                         Accept

The reason it "just works" is because in order to reduce by the empty rule you only have to pop zero items from the stack.

4
votes

Another view to maybe round out d11wtq's great answer, if that is possible:

An nullable rule (one that derives ϵ) is accounted during parser construction under the functions FOLLOW(X) and FIRST(X). For instance if you have A -> B x, and B can derive ϵ, then we have to include x in the set computed by FIRST(A). And also in the set FOLLOW(B).

Furthermore, empty rules are easily represented in the canonical LR(1) item sets.

One helpful thing is to imagine that there is an extra nonterminal symbol $ which represents the end of file.

Let's take the grammar:

S -> X | ϵ
X -> id

For the first canonical LR(1) item set, we can take the first LR(0) item set and add lookahead with the symbol '$':

S -> . X   , '$'
S -> .     , '$'
X -> . id  , '$'

Then we have one for the lookahead being id:

S -> . X   , 'id'
S -> .     , 'id
X -> . id  , 'id'

Now let's look at the FIRST and FOLLOW sets:

S -> . X   , '$'

This is not a "dot final" item, so here want to shift, but only if the set FIRST(X) contains our lookahead symbol $. This is false, so we do not fill the table entry.

Next:

S -> .     , '$'

This is a "dot final" item, so it wants to reduce. To validate the context for a reduce, we look at FOLLOW(S): can the grammar symbol we wish to reduce be followed by what is in the lookahead? Indeed, yes. $ is always in FOLLOW(S) because the start symbol is by definition followed by the end of the input. So yes, we can reduce. And since we are reducing the symbol S, the reduce is actually an accept action: the parse ends. We fill the table entry with an accept action.

Similarly we can repeat for the next item set with lookahead id. Let's skip to the S-deriving-empty rule:

S -> .     , 'id'

Can S be followed by id? Hardly. So this reduction is not appropriate. We don't fill the parser table entry.

So you can see that an empty rule poses no problem. It immediately turns into a dot final LR(0) or LR(1) item (depending on parser construction method), and is treated the same as any other dot final item with regard to considering the lookahead and filling the tables.