Can this seeming ambiguity be parsed in a LALR(1) parser (PLY)?

Question

I have a large-ish grammar in PLY (Python Lexx Yacc) for a language that has some particular challenges in parsing. The language allows the leading syntax of two kinds of calls to look the same almost until the end of the call non-terminal. This provides lot's of opportunity for reduce/reduce conflicts, since the semantics of tokens along the way are different, but can be built out of the same terminal tokens. I've extracted simple before/after versions of the grammar below, which I'll explain a bit.

Originally, expressions were a typical "Stratified Grammar," taking calls and literals and such to primary, then primary through unary, then binary to general expression. The problem was that Call_expr with two arguments conflicts with the version of Iter_expr that begins with two Id's before the '/'. The conflict was on the comma after the first argument in the call, since originally, Expr -> ... -> Primary_expr -> Name_expr -> Id was allowed. The parser could either reduce Id to Expr to match the Call_expr, or leave it to match the Iter_expr. Looking ahead to the comma didn't help it decide. If the first argument to a call is just an identifier (like a variable), this is legitimate ambiguity. Consider input id > id ( id , id ... .

My approach was to make a kind of expression that could not be just an Id. I added the production chain through all the expressions to give them '_nn' versions--'not a name.' I could then define productions for Call_expr that use any syntax in the first argument that make it more than just a name (e.g. operators, calls, etc.) to reduce it to BinOp_expr_nn, and also allow a call production that just has an Id as the first argument. This should convince the parser to just shift until it could resolve either the Iter_expr or the Call_expr (or at least know which path it's on.)

As you might have guessed, this screws everything up :). Tinkering with the expression chain also tinkers with Primary_expr, which I still need to allow to reduce to Id. But now, that's a reduce/reduce conflict--every Primary_expr can stay there or go on to Unary_expr. I can order those to make a choice (which might work), but I expect I'll just end up chasing another and another.

So, my question is: Is there a technique someone can articulate about how to allow the same tokens represent different semantics (i.e. expr vs id) that can still be parsed with LALR(1) like PLY? Short of that, any useful hacks that help manage the problem? Can this be disambiguated?

terminals:  '+' '^' ',' '>' '(' ')' '/' ':' 'id' 'literal' 
   (i.e. punctuation (besides '->' and '|', initial-lower-case words)
non-terminals:  initial-Upper-case words

Original Grammar:

S'-> S
S -> Call_expr
   | Iter_expr
Expr -> BinOp_expr
BinOp_expr -> Unary_expr
BinOp_expr -> BinOp_expr '+' BinOp_expr
Unary_expr -> Primary_expr
   | '^' BinOp_expr
Primary_expr -> Name_expr
   | Call_expr
   | Iter_expr
   | Literal_expr
Name_expr -> Id
Args -> Expr
   | Args ',' Expr
Call_expr -> Primary_expr '>' Id '(' ')'
   | Primary_expr '>' Id '(' Args ')'
Iter_expr -> Primary_expr '>' Id '(' Id '/' Expr ')'
   | Primary_expr '>' Id '(' Id ':' Id '/' Expr ')'
   | Primary_expr '>' Id '(' Id ',' Id ':' Id '/' Expr ')'
Literal_expr -> literal
Id -> id

My attempt to disambiguate the first argument in Call_expr:

S'-> S
S -> Call_expr
   | Iter_expr
Expr -> BinOp_expr_nn
   | BinOp_expr
BinOp_expr -> BinOp_expr_nn
   | Unary_expr
BinOp_expr_nn -> Unary_expr_nn
   | BinOp_expr '+' BinOp_expr
Unary_expr -> Primary_expr
   | Unary_expr_nn
Unary_expr_nn -> Primary_expr_nn
   | '^' BinOp_expr
Primary_expr -> Primary_expr_nn
   | Name_expr
Primary_expr_nn -> Call_expr
   | Iter_expr
   | Literal_expr
Name_expr -> Id
Args -> Expr
   | Args ',' Expr
Call_expr -> Primary_expr '>' Id '(' ')'
   | Primary_expr '>' Id '(' Expr ')'
   | Primary_expr '>' Id '(' Id , Args ')'
   | Primary_expr '>' Id '(' BinOp_expr_nn , Args ')'
Iter_expr -> Primary_expr '>' Id '(' Id '/' Expr ')'
   | Primary_expr '>' Id '(' Id ':' Id '/' Expr ')'
   | Primary_expr '>' Id '(' Id ',' Id ':' Id '/' Expr ')'
Literal_expr -> literal
Id -> id

I'd just fold together Iter_Expr and Call_Expr into one syntax, and then sort out what it is by walking the AST. If you know any Lisp, it should be obvious. So that is to say, allow those Id : Id / Expr forms in the Call_Expr syntax. Then walk the Call_Expr nodes to determine whether they are actually valid Iter_exprs. This will over-generate: allow invalid uses of Id : Id / Expr; you can check for that and diagnose. — Kaz
@rici: Sorry for the errors, but you've interpreted most correctly. '>' should be in the list of terminals; it separates some kinds of calls from others which are not included in my toy version. All of these calls can be chained ala "foo > bar(baz / a + baz) > qux(b, c, d, e)," and each lhs of '>' is a primary_expr. '/' is another syntax separator, not an operator. Sorry for the confusion: most of this comes from simplifications, and out of context, I'm sure it seems really weird. — jspencer
I still think you want BinOp_expr -> Unary_expr | BinOp_expr '+' Unary_expr (so that + is less-associative; as written, it is ambiguous) and Unary_expr -> Primary_expr | '^' Unary_expr so that ^ a + b means (^a) + b rather than ^(a + b)). (Or possibly Unary_expr -> Primary_expr | '^' Primary_expr, if ^^a is not to be allowed.) — rici
You might also want Primary_expr -> '(' Expr ')' in order to allow parenthesization. — rici
Yes, I'm considering those. At the moment, I'm using precedence to resolve all that, but will play around with the disambiguation. Thanks to you and @Kaz for your wonderful suggestions. Much Appreciated, and I should be on my way now! — jspencer

rici rici · Accepted Answer · 2018-04-29T02:42:09

Despite the title of your post, your grammar is not ambiguous. It is simply not LR(1), for the reason you mention: the input

A ( B ,

can be the start of either a Call_expr or an Iter_expr. In the first case, the B must be reduced to an Expr and then to Args; in the second case, it must not be reduced because it needs to still be an id when the right-hand side id '(' id ',' id ':' id '/' Expr ')' is reduced. The decision cannot be made simply by looking at a single lookahead token (the ,), so there is a shift-reduce conflict.

This conflict can be resolved with at most two additional lookahead tokens, since the shift is only a valid action if the , is followed by precisely an id and then a :. So that makes the grammar LALR(3). Unfortunately, Ply does not generate LALR(3) parsers (and neither does yacc/bison), but there are some alternatives.

1. Use a different parsing algorithm

Since the grammar is unambiguous, it can be parsed without problems (and without any modifications) with a GLR parser. Ply doesn't produce GLR parsers either, but Bison can. That's not likely to be of much use to you, but I thought I should mention it in case you are not locked into the use of Python.

2. Use a grammar which allows some invalid inputs, and discard them with semantic analysis

This is almost certainly the simplest solution, and it's what I would normally recommend. If you change the definition of Iter_expr to:

Iter_expr : id '(' id '/' Expr ')'
          | id '(' Args ':' id '/' Expr ')'

then it will still recognize every valid input (since both id and id , id are valid instances of Args). This removes the shift-reduce conflict; in effect, it lets the parser avoid having to make a decision until it either encounters a ) (which would indicate a Call_expr) or a : (indicating a Iter_expr). (There's no problem with the first alternative for Iter_expr since the decision to shift the / instead of reducing id does not require additional lookahead.)

Of course, the second production for Iter_expr will recognize a lot of things which are not valid iteration expressions: lists of more than 2 items, and lists which include expressions more complicated than just a single id. However, those inputs are not valid programs at all, so they can simply be rejected in the action for Iter_expr. The precise code to recognize a valid iteration will depend on how you represent your AST but it's not complicated: just check to make sure that the length of the Args is either one or two, and that each item in the list is just an id.

3. Use a lexical hack

One way to compensate for the lack of lookahead information is to collect it in the lexer by collecting the necessary lookahead into a buffer and only output lexemes when their syntactic category is known. In this case, the lexer can look for the sequence '(' id ',' id ':' and mark the first id so that it can only be used in an Iter_expr. In this case, the only change to the grammar is:

Iter_expr : id '(' id '/' Expr ')'
          | id '(' id ':' id '/' Expr ')'
          | id '(' iter_id ',' id ':' id '/' Expr ')'

While this will work fine in this particular case, it is not very maintainable. In particular, it relies on being able to define a simple and unambiguous pattern which can be implemented in the lexer. Since that pattern is a simplification of the grammar, it is quite possible that some future syntactic addition will create a syntax which also matches the same pattern. (This is called the lexical "hack" for a reason.)

4. Find an LALR(1) grammar

As indicated, this grammar is LALR(3). But there is no such thing as an LALR(3) language. Or more precisely, if a language has an LALR(k) grammar, then it also has an LALR(1) grammar, and that grammar can be produced mechanically from the LALR(k) grammar. Moreover, given a parse for the one-symbol lookahead grammar, it is possible to reconstruct the parse tree for the original grammar.

Because the mechanically-produced grammars are quite large, this procedure is not very attractive, and I don't know of an implementation of the algorithm. Instead, it is most common to attempt to rewrite just a part of the grammar, as you attempted in the original question. This can be done, certainly, but the end result is not totally intuitive.

However, it's not that difficult. Here, for example, is a slightly simplified version of your grammar, with the redundant unit productions removed and a couple of fixes in operator precedence (possibly incorrect, since I don't know what semantics you're aiming for).

The productions whose names end with N do not produce ID. For each of them, there is a corresponding production without the N which adds ID. Once that is done, it was necessary to rewrite Args to use the ExprN production, as well as allowing the various argument lists which start with one or two IDs. The Chain production was just to avoid some repetition.

Start   : Call
        | Iter
Expr    : ID
        | ExprN
ExprN   : UnaryN
        | Expr '+' Unary
Unary   : ID
        | UnaryN
UnaryN  : ChainN
        | '^' Chain
Chain   : ID
        | ChainN
ChainN  : PrimaryN
        | Chain '>' CallIter
PrimaryN: LITERAL
        | Call
        | Iter
        | '(' Expr ')'
Call    : ID '(' ')'
        | ID '(' ID ')'
        | ID '(' ID ',' ID ')'
        | ID '(' Args ')'
Iter    : ID '(' ID '/' Expr ')'
        | ID '(' ID ':' ID '/' Expr ')'
        | ID '(' ID ',' ID ':' ID '/' Expr ')'
Args    : ExprN ExprList
        | ID ',' ExprN ExprList
        | ID ',' ID ',' Expr ExprList
ExprList:
        | ExprList ',' Expr

I haven't tested this as much as I would have wanted to, but I think it produces the correct language.