I have a large-ish grammar in PLY (Python Lexx Yacc) for a language that has some particular challenges in parsing. The language allows the leading syntax of two kinds of calls to look the same almost until the end of the call non-terminal. This provides lot's of opportunity for reduce/reduce conflicts, since the semantics of tokens along the way are different, but can be built out of the same terminal tokens. I've extracted simple before/after versions of the grammar below, which I'll explain a bit.
Originally, expressions were a typical "Stratified Grammar," taking calls and literals and such to primary, then primary through unary, then binary to general expression. The problem was that Call_expr
with two arguments conflicts with the version of Iter_expr
that begins with two Id's before the '/'. The conflict was on the comma after the first argument in the call, since originally, Expr -> ... -> Primary_expr -> Name_expr -> Id
was allowed. The parser could either reduce Id
to Expr
to match the Call_expr
, or leave it to match the Iter_expr
. Looking ahead to the comma didn't help it decide. If the first argument to a call is just an identifier (like a variable), this is legitimate ambiguity. Consider input id > id ( id , id ...
.
My approach was to make a kind of expression that could not be just an Id
. I added the production chain through all the expressions to give them '_nn
' versions--'not a name.' I could then define productions for Call_expr
that use any syntax in the first argument that make it more than just a name (e.g. operators, calls, etc.) to reduce it to BinOp_expr_nn
, and also allow a call production that just has an Id
as the first argument. This should convince the parser to just shift until it could resolve either the Iter_expr
or the Call_expr
(or at least know which path it's on.)
As you might have guessed, this screws everything up :). Tinkering with the expression chain also tinkers with Primary_expr
, which I still need to allow to reduce to Id
. But now, that's a reduce/reduce conflict--every Primary_expr
can stay there or go on to Unary_expr
. I can order those to make a choice (which might work), but I expect I'll just end up chasing another and another.
So, my question is: Is there a technique someone can articulate about how to allow the same tokens represent different semantics (i.e. expr vs id) that can still be parsed with LALR(1) like PLY? Short of that, any useful hacks that help manage the problem? Can this be disambiguated?
terminals: '+' '^' ',' '>' '(' ')' '/' ':' 'id' 'literal'
(i.e. punctuation (besides '->' and '|', initial-lower-case words)
non-terminals: initial-Upper-case words
Original Grammar:
S'-> S
S -> Call_expr
| Iter_expr
Expr -> BinOp_expr
BinOp_expr -> Unary_expr
BinOp_expr -> BinOp_expr '+' BinOp_expr
Unary_expr -> Primary_expr
| '^' BinOp_expr
Primary_expr -> Name_expr
| Call_expr
| Iter_expr
| Literal_expr
Name_expr -> Id
Args -> Expr
| Args ',' Expr
Call_expr -> Primary_expr '>' Id '(' ')'
| Primary_expr '>' Id '(' Args ')'
Iter_expr -> Primary_expr '>' Id '(' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ':' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ',' Id ':' Id '/' Expr ')'
Literal_expr -> literal
Id -> id
My attempt to disambiguate the first argument in Call_expr
:
S'-> S
S -> Call_expr
| Iter_expr
Expr -> BinOp_expr_nn
| BinOp_expr
BinOp_expr -> BinOp_expr_nn
| Unary_expr
BinOp_expr_nn -> Unary_expr_nn
| BinOp_expr '+' BinOp_expr
Unary_expr -> Primary_expr
| Unary_expr_nn
Unary_expr_nn -> Primary_expr_nn
| '^' BinOp_expr
Primary_expr -> Primary_expr_nn
| Name_expr
Primary_expr_nn -> Call_expr
| Iter_expr
| Literal_expr
Name_expr -> Id
Args -> Expr
| Args ',' Expr
Call_expr -> Primary_expr '>' Id '(' ')'
| Primary_expr '>' Id '(' Expr ')'
| Primary_expr '>' Id '(' Id , Args ')'
| Primary_expr '>' Id '(' BinOp_expr_nn , Args ')'
Iter_expr -> Primary_expr '>' Id '(' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ':' Id '/' Expr ')'
| Primary_expr '>' Id '(' Id ',' Id ':' Id '/' Expr ')'
Literal_expr -> literal
Id -> id
BinOp_expr -> Unary_expr | BinOp_expr '+' Unary_expr
(so that+
is less-associative; as written, it is ambiguous) andUnary_expr -> Primary_expr | '^' Unary_expr
so that^ a + b
means(^a) + b
rather than^(a + b)
). (Or possiblyUnary_expr -> Primary_expr | '^' Primary_expr
, if^^a
is not to be allowed.) – riciPrimary_expr -> '(' Expr ')'
in order to allow parenthesization. – rici