Berkeley Yacc versus GNU Bison: different tolerance w.r.t trailing tokens

Question

I'm running into a parser portability problem: different behavior when translated using byacc versus bison --yacc.

The Bison-generated parser allows me to call yyparse to extract just some prefix of the input token sequence which can be derived from the start symbol by the grammar rules. This is because Bison's generated parser has a $default reduce action. This, I think, means that "for any lookahead tokens which are not mentioned by other actions (because they don't match the grammar), perform this reduce".

By contrast, for the same state and set of rules, Berkeley Yacc does not have such a default reduce rule. The same reduction is keyed to matching the $end symbol specifically. In other words, in the the Berkeley-Yacc-generated parser, the rule is effectively "anchored" to the end, like a regex with a $ on it.

(Note: the Bison-generated parser still anchors the grammar as a whole to the end, but it it does this by matching on $end only in the top-level rule for the start symbol; it does not proliferate $end matching into the subordinate rules!)

The difference matters because I call yyparse more than once to extract successive phrasal units. This works with Bison, because I YYACCEPT before it has a chance to reduce to the start symbol where the implicit $end token is required. (Well, it doesn't exactly work "out of the box": but with a certain trick it be made to work). With Berkeley Yacc, the syntax error due to not seeing the $end in the subordinate rules which derive the start symbol means that the whole scheme is dead in the water.

Is there some way to get Berkeley Yacc to do a default reduce for lookahead token values which don't match the grammar-defined continuation or termination of the syntax? In other words, to populate all the unused entries in the LALR(1) table for that state with default reduces?

Idea: I'm thinking that perhaps the phrasal unit being extracted can be embroiled into a repetition rule. Instead of trying to parse out a single expr, I can parse a "sequence of expr" via such a newly introduced shim rule; but then immediately YYACCEPT after getting just one, along the lines of:

start : exprs { $$ = $1; };

exprs : expr { $$ = $1; YYACCEPT; /* Got just the one expr I really wanted! */ } 
      | exprs expr { /* Never reached! Byacc fooled! */ }
      ;

I will try it this way, but it would still be good to know why there is such a gaping difference between two Yacc implementations, and how to overcome it directly.

Edit: a hack which works in my project is along the lines of this pseudo-code:

start : expr { YYACCEPT; } byacc_fool { /*notreached*/ abort(); }

byacc_fool : expr { abort(); }
           | /*nothing*/ { abort(); }
           ;

All regression tests pass with Byacc or Bison.

None of the dummy actions are reached; they serve the purpose of creating a grammar rule which allows expr to be followed by another expr. But this is achieved in such a way that it won't actually consume the second expr; at the point of the YYACCEPT invocation, just one lookahead token is consumed from any following expr. (I have a solution in place to restore that token before each successive yyparse call; and that hack now works under Byacc.)

I still have a feeling like there is something simple I'm not seeing that everyone else knows.

I would regard the bison behaviour as non-standard. The goal symbol is implicitly followed by EOF, in every arser generator I have ever used since 1979. I suggest what you should really be doing is adjusting your architecture to call yyparse() once, and execute the per-item actions in the production(s) concerned, which after all is the whole idea. — user207421
@EJP "EJP: The goal symbol is implicitly followed by EOF ..." But this is in fact the case for the root production for the start symbol! Just not in the subordinate rules. Bison will reduce to the top rule, and catch it there. This gives us an opportunity to YYACCEPT inside a rule before that happens. I will edit the question to clarify this. — Kaz
@EJP suggest ... adjusting your architecture to call yyparse() once. I'm afraid that's only possible in toy languages. The actions are to construct an abstract syntax tree which must be returned; this implements a "parse expression" function in a programming language. Having control sit inside yyparse is unacceptable, because it changes the API. The caller can't just call read and retrieve the next object parsed from the stream. The caller is a program not necessarily under my control, using my published API. — Kaz
Another issue is that I disable GC when calling yyparse. The reason is that some of the YYSTYPE stack symbols contain heap references. The garbage collector doesn't know where this Yacc stack is and so it cannot walk it. Disabling GC for too long is bad. — Kaz
If you can call yyparse() twice, the implication is that the parser has stopped before shifting EOF. — user207421

akim akim · Accepted Answer · 2015-08-18T09:50:28

FWIW, I think the analysis is slightly incorrect, and is misleading on $default.

$default is a compression mechanism: instead of a long list of lookaheads for which the reduction is the same, let's perform this reduction for all the remaining lookaheads, including those that were not valid. This does not allow the parser to accept more sentences; it can be proved that the error will be caught in some other rule. It does not either allow the parser to accept a longer prefix: the error will be caught at exactly the same point; however maybe a few more reductions will be performed. In exchange, your tables are smaller, since you have fewer cases to encode. Today it might be a useless optimization, but back then space was really critical.

I believe that both Bison and Byacc implement this technique (actually Bison and Byacc are a fork of the same program and still share quite a lot of code).

However, at some point in the history of Bison, the initial rule was added: $accept: exp $end, where exp denotes the user's start symbol. Before the handling of this rule was hardcoded in Bison (the generator) and the generated parsers, and even the tables were hand tuned to deal with exp and $end without using this "rule 0". I changed that, because (i) the theory of LR parsers does use this rule, so it was a problem when teaching theory to have Bison display different tables, and (ii) the code was much simpler, shorter, with this rule 0.

As an unexpected consequence (that change was done many many years ago, this is the first time I hear that it is observable), the $end token is treated as the other, and is subject to being fused within a $default action. This is were you can tell a difference between Byacc and Bison I guess.

That being said, I don't have another workaround to propose than yours :) Except, if you are ready to move to Bison only (which obviously is not the case), moving to a push-parser. This was done exactly to match your needs: interactive loops rather than batch parsing.

Berkeley Yacc versus GNU Bison: different tolerance w.r.t trailing tokens

1 Answers