3
votes

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.

start:
  stmt_list

expr:
  INT | IDENT | lambda | '(' expr ')' { $$ = $2 }

lambda:
  '(' params ')' '{' stmt_list '}'

params:
  expr | params ',' expr

stmt:
  /* empty */ | expr

stmt_list:
  stmt | stmt_list ';' stmt

A lambda function looks something like this:

map((v) { v * 2 }, collection)

My parser emits:

conflicts: 1 shift/reduce

Given the input:

(a)

It correctly parses an expr by the '(' expr ')' rule. However given an input of:

(a) { a }

(Which would be a lambda for the identity function, returning its input). I get:

syntax error: unexpected '{'

This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.

EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.

EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.

2
Are statements not separated by punctuation? IOW, if (v) {v*2} is not a lambda, what is it? A stmt_list or something else? Please add a bit more grammar to the question.rici
There are no semicolons between statements (they are actually comments, as in lisp. New lines are effectively end-of-statement indicators, as in Go or JavaScript.d11wtq
This does pose some complexity in the grammar, as newlines may occur, for example in argument lists. Maybe this is what's getting me into trouble. I just know its complaining about the '{' right after its reduced the argument list as an expr.d11wtq
I could post the entire grammar, but obviously it's long and hard to distill down to the root cause of the conflict.d11wtq
Actually TBH I haven't tried writing a tiny grammar with this issue in isolation. That would help. I'll do that and probably have a lightbulb moment :)d11wtq

2 Answers

3
votes

Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.

If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.

There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNUTα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.

Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.

First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).

The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:

start:
  stmt_list

expr:
    INT
  | IDENT
  | lambda
  | '(' expr_list ')'
        { // If $2 has more than one expr, report error
          $$ = $2
        }

lambda:
  '(' expr_list ')' '{' stmt_list '}'
        { // If anything in expr_list is not a valid param, report error
          $$ = make_lambda($2, $4)
        }

expr_list:
  expr | expr_list ',' expr

stmt:
  /* empty */ | expr

stmt_list:
  stmt | stmt_list ';' stmt

Notes

  1. That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
0
votes

If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.

See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html

You can get this effect two ways: a) Count on the accidental behavior of the parsing engine. If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict. b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.

I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).

Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.

The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.