It is clear that the following is ambiguous:
expr : NUMBER %prec num_p
| NUMBER ',' expr %prec exp_p
| expr ',' NUMBER ',' NUMBER
since any list of three or more numbers can be parsed in various ways. Roughly speaking, we can take single numbers off the beginning of the list or pairs of numbers off the end of the list, until we meet somewhere in the middle; however, there is no definition of where the middle point might be.
Consider, for example, the various parse trees which could produce 1, 2, 3, 4, 5
. Here are just two (with numbers indicating which production was used to expand expr
):
expr(2) expr(3)
/ \ / | \
/ \ / | |
| expr(2) / | |
| / \ / | |
| / \ expr(3) | |
| / expr(3) / | \ | |
| | / | \ / | \ | |
| |expr(1)| \ expr(1)| | | |
| | | | | | | | | |
1 , 2 , 3 , 4 , 5 1 , 2 , 3 , 4 , 5
Both of the above trees are, in some sense, maximal. The one on the left takes as many single NUMBERs as possible using production 2, until only two NUMBERs are left for production 3. The one on the right applies production 3 as many times as possible. (It would have needed a single application of production 2 if the list of numbers had an even length.)
In order to resolve the ambiguity, we need a clear statement of the intent. But it seems to me unlikely that it can be resolved with a precedence declaration. Remember that precedence relations are always between a possible reduction (on the top of the parser stack) and a lookahead symbol. They never compare two lookahead symbols nor two productions. If the lookahead symbol wins, it is shifted onto the stack; if the reduction wins, the stack is reduced. It is no more complicated than that.
So if precedence could help, the relevant token must be ','
, not NUMBER
. NUMBER
must always be shifted onto the parse stack. Since no production ends with ','
, it is never possible to reduce the stack when NUMBER
is the lookahead symbol. By contrast, when ','
is the lookahead symbol, it is usually possible either to reduce the top of the parser stack or to shift the ','
in preparation for a different reduction.
The only place where this decision is even possible is in the case where we have seen NUMBER
and are looking at a ','
, so we have to decide whether to apply production 1, in preparation for production 3, or shift the ','
, leaving production 2 as the only option. Neither of these decisions can prosper: If we choose to reduce, it is possible that production 3 will turn out to be impossible because there are not enough numbers in the list; if we choose to shift, then production 3 will never be used.
In a left-to-right parse, the algorithm which produces the right-hand parse above is not possible, because we cannot know whether the list is of even or odd length until we reach the end, at which point it is far too late to retroactively reduce the first two numbers. On the other hand, the left-hand parse will require a lookahead of four tokens, not one, in order to decide at which point to stop using production 2. That makes it possible to construct an LR(1) grammar, because there is a way to rewrite any LR(k) grammar as an LR(1) grammar, but the resulting grammar is usually complicated.
I suspect that neither of these was really your intention. In order to recommend a resolution, it will be necessary to know what the precise intention is.
One possibility (motivated by a comment) is that expr
also includes something which is neither a number nor a list of numbers:
expr: NUMBER
| complex_expression
In that case, it might be that the grammar intends to capture two possibilities:
What is left ambiguous in the above formulation is a list consisting only of NUMBER
s, since either the first or last number could be parsed as an expr
. Here there are only a couple of reasonable possible resolutions:
a list of NUMBER
s is always be parsed as the first option (expr
at the end)
a list of NUMBER
s is parsed as the second option (expr
at the beginning) if and only if there are an odd number of elements in the list.
The first resolution is much easier, so we can start with it. In this case, the first element in the list essentially determines how the list will be parsed, so it not possible to reduce the first NUMBER
to expr
. We can express that by separating the different types of expr
:
expr: list_starting_expr | list_ending_expr
list_starting_expr: complex_expression ',' NUMBER ',' NUMBER
| list_start_expr ',' NUMBER ',' NUMBER
list_ending_expr : complex_expression
| NUMBER ',' list_ending_expr
| NUMBER
The last production in the above example allows for a list entirely of NUMBER
s to be parsed as a list_ending_expr
.
It also allows a list containing only a single complex_expression
to be parsed as a list_ending_expr
, while a list_starting_expr
is required to have at least three elements. Without that, a list consisting only of a complex_expression
would have been ambiguous. In the example grammar in the question, the list containing only a complex_expression
is implicitly forbidden; that could be reproduced by changing the base production for list_ending_expr
from list_ending_expr: complex_expression
to list_ending_expr: NUMBER ',' complex_expression
.
But what if we wanted the second resolution? We can still recognize the language, but constructing a correct parse tree may require some surgery. We can start by separating out the case where the list consists only of NUMBER
s.
expr: list_starting_expr | list_ending_expr | ambiguous_list
list_starting_expr: complex_expression ',' NUMBER ',' NUMBER
| list_starting_expr ',' NUMBER ',' NUMBER
list_ending_expr : NUMBER ',' complex_expression
| NUMBER ',' list_ending_expr
ambiguous_list : NUMBER
| NUMBER ',' ambiguous_list
Despite the frequently-repeated claim that right-recursion should be avoided in bottom-up grammars, here it is essential that ambiguous_list
and list_ending_expr
be right-recursive, because we cannot distinguish between the two possibilities until we reach the end of the list.
Now we could use a semantic action to simply count the number of elements in the list. That action needs to be associated with the reduction of ambiguous_list
to expr
:
expr: list_starting_expr | list_ending_expr
| ambiguous_list {
if (count_elements($1) % 2 == 1) {
$$ = make_list_starting_expr($1);
}
else {
$$ = make_list_starting_expr($1);
}
}
But we can actually distinguish the two cases grammatically, precisely because of the right recursion:
expr: list_starting_expr
| list_ending_expr
| even_list_of_numbers
| odd_list_of_numbers
list_starting_expr : complex_expression ',' NUMBER ',' NUMBER
| list_starting_expr ',' NUMBER ',' NUMBER
list_ending_expr : NUMBER ',' complex_expression
| NUMBER ',' list_ending_expr
odd_list_of_numbers : NUMBER
| NUMBER ',' NUMBER ',' odd_list_of_numbers
even_list_of_numbers: NUMBER ',' NUMBER
| NUMBER ',' NUMBER ',' even_list_of_numbers
It might be more meaningful to write this as:
expr: expr_type_one | expr_type_two
expr_type_one: list_starting_expr | even_list_of_numbers
expr_type_two: list_ending_expr | odd_list_of_numbers
/* Remainder as above */
Note: The above grammars, like the grammar in the original question, do not allow expr
to consist only of a complex_expression
. If that it were desired to handle the case of only a single complex_expression
, then that syntax could be added directly to the productions for expr
(or to whichever of expr_type_one
or expr_type_two
were appropriate.
expr : NUMBER | NUMBER ',' expr | expr ',' NUMBER ',' NUMBER
– Miloš Ljumović