Bison: distinguish variables by type / YYERROR in GLR-parsers

Question

I'm working on a parser, using GNU bison, and I'm facing an interesting problem. My actual problem is a bit different and less interesting in general, so I will state it a bit differently, so that an answer would be more useful in general.

I need to distinguish expressions depending on their type, like arithmetic expressions from string expressions. Their top-level nonterminals have some common ancestors, like

statement
    : TOK_PRINT expression TOK_SEMICOLON {...}
;
expression
    : arithmeticexpression {...}
    | stringexpression {...}
;

Now I need to be able to have variables in both kinds of expressions

arithmeticexpression
    : TOK_IDENTIFIER {...}
;
stringexpression
    : TOK_IDENTIFIER {...}
;

(only the variables of string-type should be allowed in the stringexpression case and only variables of int- or float-type should be allowed in the arithmeticexpression case) But this obviously causes an R/R conflict, which is inherent in the language - it is impossible to resolve, because the language is ambiguous.

Of course I could water down the language, so that only general "Expression" objects are passed around on the parse-stack, but then I'd have to do a lot of manual type-checking in the actions, which I would like to avoid. Also, in my real use-case, the paths through the grammar to the variable-rules are so different, that I would have to water the language down so much that I would lose a lot of grammar-rules (i.e. lose a lot of structural information) and would need to hand-write parsing-engines into some actions.

I have read about GLR-parsing, which sounds like it could solve my problem. I'm considering to use this feature: have the grammar as above and YYERROR in the paths where the corresponding variable has the wrong type.

arithmeticexpression
    : TOK_IDENTIFIER {
          if(!dynamic_cast<IntVariable*>(
              symbol_table[*$<stringvalue>1]))
              YYERROR;
      }
;
stringexpression
    : TOK_IDENTIFIER {
          if(!dynamic_cast<StringVariable*>(
              symbol_table[*$<stringvalue>1]))
              YYERROR;
      }
;

but the Bison Manual says

During deterministic GLR operation, the effect of YYERROR is the same as its effect in a deterministic parser.

The effect in a deferred action is similar, but the precise point of the error is undefined; instead, the parser reverts to deterministic operation, selecting an unspecified stack on which to continue with a syntax error.

In a semantic predicate (see Semantic Predicates) during nondeterministic parsing, YYERROR silently prunes the parse that invoked the test.

I'm not sure I understand this correctly - I understand it this way:

doesn't apply here, because GLR-parsing isn't deterministic
is how I have it in the code above, but shouldn't be done this way, because it is completely random which path is killed by the YYERROR (correct me if I'm wrong)
wouldn't solve my problem, because semantic predicates (not semantic actions!) must be at the beginning of a rule (correct me if I'm wrong), at which point the yylval from the TOK_IDENTIFIER token isn't accessible (correct me if I'm wrong), so I can't look into the symbol table to find the type of the variable.

Does anyone have experience with such a situation? Is my understanding of the manual wrong? How would you approach this problem? To me, the problem seems to be so natural that I would assume people run into it often enough that bison would have a solution built-in...

rici rici · Accepted Answer · 2018-06-13T18:12:33

Note: Since at least bison 3.0.1 and as of bison 3.0.5, there is a bug in the handling of semantic predicate actions which causes bison to output a #line directive in the middle of a line, causing the compile to fail. A simple fix is described in this answer, and has been committed to the maint branch of the bison repository. (Building bison from the repository is not for the faint-hearted. But it would suffice to download data/c.m4 from that repository if you didn't want to just edit the file yourself.)

Let's start with the concrete question about YYERROR in a GLR parser.

In effect, you can use YYERROR in a semantic predicate in order to disambiguate a parse by rejecting the production it is part of. You cannot do this in a semantic action because semantic actions are not executed until the parser has decided on a unique parse, at which point there is no longer an ambiguity.

Semantic actions can occur anywhere in a rule, and will be immediately executed when they are reduced, even if the reduction is ambiguous. Because they are executed out of sequence, they cannot refer to the value produced by any preceding semantic action (and semantic predicates themselves have no semantic values even though they occupy a stack slot). Also, because they are effectively executed speculatively as part of a production which might not be part of the final parse, they should not mutate the parser state.

The semantic predicate does have access to the lookahead token, if there is one, through the variable yychar. If yychar is neither YYEMPTY nor YYEOF, the associated semantic value and location information are in yylval and yylloc, respectively. (see Lookahead Tokens). Care must be taken before using this feature, though; if there is no ambiguity at that point in the grammar, it is likely that the bison parser will have executed the semantic predicate without reading the lookahead token.

So your understanding was close, but not quite correct:

For efficiency, the GLR parser recognizes when there is no ambiguity in the parse, and uses the normal straightforward shift-reduce algorithm at those points. In most grammars, ambiguities are rare and are resolved rapidly, so this optimisation means that the overhead associated with maintaining multiple alternatives can be avoided for the majority of the parse. In this mode (which won't apply if ambiguity is possible, of course), YYERROR leads to standard error recovery, as in a shift-reduce parser, at the point of reduction of the action.
Since deferred actions are not run until ambiguity has been resolved, YYERROR in a deferred action will effectively be executed as though it were in a deterministic state, as in the previous paragraph. This is true even if there is a custom merge procedure; if a custom merge is in progress, all the alternatives participating in the custom merge will be discarded. Afterwards, the normal error recovery algorithm will continue. I don't think that the error recovery point is "random", but predicting where it will occur is not easy.
As indicated above, semantic predicates can appear anywhere in a rule, and have access to the lookahead token if there is one. (The Bison manual is a bit misleading here: a semantic predicate cannot contain an invocation of YYERROR because YYERROR is a statement, and the contents of the semantic predicate block is an expression. If the value of the semantic predicate is false, then the parser performs the YYERROR action.)

In effect, that means that you could use semantic predicates instead of (or as well as) actions, but basically in the way that you propose:

arithmeticexpression
    : TOK_IDENTIFIER %?{
          dynamic_cast<IntVariable*>(symbol_table[*$1])
      }
stringexpression
    : TOK_IDENTIFIER %?{
          dynamic_cast<StringVariable*>(symbol_table[*$1])
      }

Note: I removed the type declarations from $<stringvalue>1 because:

It's basically unreadable, at least to me,
More importantly, like an explicit cast, it is not typesafe.

You should declare the semantic type of your tokens:

%type <stringvalue> ID

which will allow bison to type-check.

Whether or not using semantic predicates for that purpose is a good idea is another question.

Bison: distinguish variables by type / YYERROR in GLR-parsers

2 Answers