Your grammar is accurate and unambiguous, as far as I can see. But it is not finite lookahead, much less LR(1).
An LR(k) parser must be able to predict a reduction at the point in the input where the non-terminal is complete, based only on an examination of the next k tokens.
Now consider the input consisting of a large number (relative to k) of A
tokens, followed by a B
. If the input will eventually match Y
, the parser must now reduce Y → A
and then possibly one or more instances of Y → A Y
, and the left-to-right nature of LR parsing means that these reductions must take place immediately.
But there is no way of knowing how many instances to reduce without seeing the entire rest of the input and counting the number of B
s. Moreover, it is entirely possible that the input matches X
, in which case the B
must be shifted since the reduction to Y
is incorrect. Again, there is not enough information to make this decision until the end of the input is reached.
These dilemmas are revealed in the state table produced by passing the -v
option to yacc/bison. For example, we can see:
State 1
4 X: A . X B
7 Y: A . Y B
8 | A . Y
9 | A .
A shift, and go to state 1
B shift, and go to state 2
B [reduce using rule 9 (Y)]
$default reduce using rule 9 (Y)
X go to state 7
Y go to state 8
which expresses the question of whether to reduce the last A
to a Y
on the assumption that there are more A
s than B
s, or to shift the first B
in preparation for reducing it to an X
, on the assumption that there are more B
s than A
s.
(The conflict is indicated by the presence of two different actions for lookahead B
. The reduce action is contained within brackets ([reduce using rule 9 (Y)]
) to indicate that bison has resolved the conflict by choosing to always shift. Of course, that is not always the correct resolution, so your parser will not correctly identify inputs which require the reduction action.)
This language does have an LR(1) grammar. The trick is to recognize the inner balanced sequence of A
s and B
s first, without forcing the parser to decide which type of sentence it has encountered. That decision can be made when a B
is encountered which does not match any A
, or when the end of the sentence is found with some A
s yet unmatched.
Here, AB
is a balanced sequence of A
s and B
s. AAB
one or more A
s followed by an AB
; while ABB
is an AB
followed by one or more B
s.
stmt: tail NL
| C tail NL
tail: AAB
| ABB
AAB : A AB
| A AAB
ABB : AB B
| ABB B
AB : /* empty */
| A AB B
It is tempting to write
AAB : AA AB
AA : A
| A AA
but it won't work, because it requires the parser to decide whether an A
is part of an AA
before it ever sees a B
. The way it is written above, the A
is always shifted onto the stack, and a decision is made as the parser stack is unrolled.
On the other hand, it would have been possible to have written the production for ABB
in this way (ABB: AB BB; BB: B | BB B
), because by the time the parser reaches the first unmatched B
all relevant decisions have already been made. I chose to do it as above for symmetry.