Call flex yy_push_state() from bison parser

Question

Is it possible to call yy_push_state() from a bison generated parser? How can this be done?

context:
    /* empty */ { $$ = NULL; yy_push_state(SOME_STATE); }
;

rule:
    context operator STRING { create_expr($2, $3); }
;

I would like to be able to call yy_push_state() from parser and also would like to know if it's an acceptable practice. If not, what are the alternatives to communicate lexer it should push a state?

In this specific case, only the parser knows when to push SOME_STATE.

rici rici · Accepted Answer · 2015-02-06T07:02:37

It's certainly possible, but there is a huge warning. It's not at all obvious to me why you feel you need to do that in the example you provide; there is probably an alternative, but it's impossible to provide any advice without knowing more details about the use case.

Here's the warning. In the example you provide, the state push is generated by a marker production; conceptually (and maybe even in practice) you could use a mid-rule action:

rule:                 { yy_push_state(SOME_STATE); }
      operator STRING { create_expr($2, $3); }

The state push will happen when the empty production is reduced; that may or may not occur before the first token of operator is read, but in most cases it will be afterwards. So if, for example, the intention were to change the lexer to recognize (or not recognize) context-specific operators, then it will likely fail.

bison normally reduces immediately (without a lookahead token) if the lookahead token is completely unneeded at that point in the parse, but that behaviour is not guaranteed, and IMHO should not be relied upon. Other parsers (yacc, for example) don't do this; older bison versions didn't IIRC, and it's at least possible that different parser types (IELR, GLR) might have different views on whether a lookahead token is necessary.

So on the whole, it is better to be prepared for the likely case that a lookahead token has been read (which is why it is necessary to copy yytext, for example), while being careful to not make the assumption that it will have been read.

If your state change is robust enough, then go ahead and do the yy_push_state in the parser.

For example, suppose that operation is not nullable and that the state change will change the rules by which STRING is recognized but will not have any effect on the lexical scan of any token which might appear in operator. In that case, the yy_push_state is safe.

One place I've seen this hack attempted is trying to parse languages like awk and javascript where / might be a division operator or the beginning of a regex literal. In that case, it is possible to get the parser to change the lexical state in the regex case:

// Lexer
"/"  { return '/';
       /* No semantics, the parser will know what it means */
     }
<REGEX> {
   /* Lots of rules here. But unescaped / is just the same as above */
   "/"  { return '/';
          /* No semantics, the parser will know what it means */
        }
}

// Parser
expr: { BEGIN(REGEX); } '/' regex { BEGIN(INITIAL); } '/'
    | expr '/' expr
    | ...

In the above case, the state change has no effect on how the lexer handles /, so if that slash is recognized as starting (or ending) a regex, the state change will take place either just before or (more likely) just after the / token has been scanned. This wouldn't have worked if the lexer had tried (unnecessarily, but it seems to be a temptation) to return different tokens for the two different uses of /; a good guideline is that the less the lexer knows about the semantics of tokens, the better.

Call flex yy_push_state() from bison parser

1 Answers