I'm running into a parser portability problem: different behavior when translated using byacc
versus bison --yacc
.
The Bison-generated parser allows me to call yyparse
to extract just some prefix of the input token sequence which can be derived from the start symbol by the grammar rules. This is because Bison's generated parser has a $default
reduce action. This, I think, means that "for any lookahead tokens which are not mentioned by other actions (because they don't match the grammar), perform this reduce".
By contrast, for the same state and set of rules, Berkeley Yacc does not have such a default reduce rule. The same reduction is keyed to matching the $end
symbol specifically. In other words, in the the Berkeley-Yacc-generated parser, the rule is effectively "anchored" to the end, like a regex with a $ on it.
(Note: the Bison-generated parser still anchors the grammar as a whole to the end, but it it does this by matching on $end
only in the top-level rule for the start symbol; it does not proliferate $end
matching into the subordinate rules!)
The difference matters because I call yyparse
more than once to extract successive phrasal units. This works with Bison, because I YYACCEPT
before it has a chance to reduce to the start symbol where the implicit $end
token is required. (Well, it doesn't exactly work "out of the box": but with a certain trick it be made to work). With Berkeley Yacc, the syntax error due to not seeing the $end
in the subordinate rules which derive the start symbol means that the whole scheme is dead in the water.
Is there some way to get Berkeley Yacc to do a default reduce for lookahead token values which don't match the grammar-defined continuation or termination of the syntax? In other words, to populate all the unused entries in the LALR(1) table for that state with default reduces?
Idea: I'm thinking that perhaps the phrasal unit being extracted can be embroiled into a repetition rule. Instead of trying to parse out a single expr
, I can parse a "sequence of expr
" via such a newly introduced shim rule; but then immediately YYACCEPT
after getting just one, along the lines of:
start : exprs { $$ = $1; };
exprs : expr { $$ = $1; YYACCEPT; /* Got just the one expr I really wanted! */ }
| exprs expr { /* Never reached! Byacc fooled! */ }
;
I will try it this way, but it would still be good to know why there is such a gaping difference between two Yacc implementations, and how to overcome it directly.
Edit: a hack which works in my project is along the lines of this pseudo-code:
start : expr { YYACCEPT; } byacc_fool { /*notreached*/ abort(); }
byacc_fool : expr { abort(); }
| /*nothing*/ { abort(); }
;
All regression tests pass with Byacc or Bison.
None of the dummy actions are reached; they serve the purpose of creating a grammar rule which allows expr
to be followed by another expr
. But this is achieved in such a way that it won't actually consume the second expr
; at the point of the YYACCEPT invocation, just one lookahead token is consumed from any following expr. (I have a solution in place to restore that token before each successive yyparse
call; and that hack now works under Byacc.)
I still have a feeling like there is something simple I'm not seeing that everyone else knows.
bison
behaviour as non-standard. The goal symbol is implicitly followed by EOF, in every arser generator I have ever used since 1979. I suggest what you should really be doing is adjusting your architecture to callyyparse()
once, and execute the per-item actions in the production(s) concerned, which after all is the whole idea. – user207421YYACCEPT
inside a rule before that happens. I will edit the question to clarify this. – Kazyyparse
is unacceptable, because it changes the API. The caller can't just callread
and retrieve the next object parsed from the stream. The caller is a program not necessarily under my control, using my published API. – Kazyyparse
. The reason is that some of theYYSTYPE
stack symbols contain heap references. The garbage collector doesn't know where this Yacc stack is and so it cannot walk it. Disabling GC for too long is bad. – Kazyyparse()
twice, the implication is that the parser has stopped before shifting EOF. – user207421