1
votes

Is it possible to change lexical state (aka "start condition") from within the grammar rules of Jison?

Am parsing a computer language where lexical state clearly changes (at least to my human mindset) when certain grammar rules are satisfied, even though there's no token I can exactly point to in the lexer.

(The reason I think this is that certain keywords are reserved/reservable in one state but not the other.)

It's definitely possible to change the lexical state from within the lexer, e.g.:

%lex
%x expression

%% 
{id}               {  return 'ID';
"="                { this.begin('expression'); return '='; }
<expression>";"    { this.popState(); return ';'; }

But is there a way to change lexical state when certain grammar rules are matched?

%% /* language grammar */

something :  pattern1 pattern2 { this.beginState('expression'); $$ = [$1,$2]; };
pattern1 : some stuff { $$ = [$1, $2]; }
pattern2 : other stuff { $$ = [$1, $2]; }

If I try this, I get

TypeError: this.popState is not a function
      at Object.anonymous (eval at createParser (/Users/me/Exp/stats/node_modules/jison/lib/jison.js:1327:23), <anonymous>:47:67)
      at Object.parse (eval at createParser (/Users/me/Exp/stats/node_modules/jison/lib/jison.js:1327:23), <anonymous>:329:36)

I'm not sure if what I'm asking for is theoretically impossible or conceptually naive (e.g. is this the very meaning of context free grammar?), or it's there and I'm just not reading the docs right.

1

1 Answers

5
votes

The lexer object is available in a parser action as yy.lexer, so you can change the start condition with yy.lexer.begin('expression'); and go back to the old one with yy.lexer.popState(). That part is not problematic.

However, you need to think about when the new start condition will take effect. An LALR(1) parser, such as the one implemented by jison (or bison), uses a single lookahead token to decide what action to take. (The "1" in LALR(1) is the length of the possible lookahead.) That means that when a parser action is executed -- when the rule it is attached to is reduced -- the next token has probably already been read.

This will not always be the case; both jison and bison will sometimes be able to do a reduction without using the lookahead token, in which case they will not yet have read it.

In short, a change to the lexer state in an action might take effect before the next token is read, but most of the time it will take effect when the second next token is read. Because of this ambiguity, it is usually best to make lexer state changes prior to a token which is not affected by the lexer state change.

Consider, for example, the standard calculator. The following example is adapted from the jison manual:

%lex
%%

\s+                   /* skip whitespace */
[0-9]+\b              yytext=parseInt(yytext); return 'NUMBER'
[*/+%()-]             return yytext[0]
<<EOF>>               return 'EOF'
.                     return 'INVALID'

/lex

%left '+' '-'
%left '*' '/' '%'
%left UMINUS

%start expressions

%% /* language grammar */

expressions: e EOF              {return $1;};

e   : e '+' e                   {$$ = $1+$3;}
    | e '-' e                   {$$ = $1-$3;}
    | e '*' e                   {$$ = $1*$3;}
    | e '/' e                   {$$ = $1/$3;}
    | e '%' e                   {$$ = $1%$3;}
    | '-' e %prec UMINUS        {$$ = -$2;}
    | '(' e ')'                 {$$ = $2;}
    | NUMBER                    {$$ = $1;}
    ;

Now, let's modify it so that between [ and ] all numbers are interpreted as hexadecimal. We use a non-exclusive start condition called HEX; when it is enabled, hexadecimal numbers are recognized and converted accordingly.

%lex
%s HEX
%%

\s+                   /* skip whitespace */
<INITIAL>[0-9]+("."[0-9]+)?\b  yytext=parseInt(yytext); return 'NUMBER'
<HEX>[0-9a-fA-F]+\b            yytext=parseInt(yytext, 16); return 'NUMBER'
[*/+%()[\]-]          return yytext[0]
<<EOF>>               return 'EOF'
.                     return 'INVALID'

/lex

%left '+' '-'
%left '*' '/' '%'
%left UMINUS

%start expressions

%% /* language grammar */

expressions: e EOF              {return $1;};

e   : e '+' e                   {$$ = $1+$3;}
    | e '-' e                   {$$ = $1-$3;}
    | e '*' e                   {$$ = $1*$3;}
    | e '/' e                   {$$ = $1/$3;}
    | e '%' e                   {$$ = $1%$3;}
    | '-' e %prec UMINUS        {$$ = -$2;}
    | '(' e ')'                 {$$ = $2;}
    | hex '[' e unhex ']'       {$$ = $3;}
    | NUMBER                    {$$ = $1;}
    ;
hex :                           { yy.lexer.begin('HEX'); } ;
unhex:                          { yy.lexer.popState(); } ;

Here, we use the empty non-terminals hex and unhex to change lexer state. (In bison, I would have used a mid-rule action, which is very similar, but jison doesn't seem to implement them.) The key is that the state changes are done before the [ and ] tokens, which are not affected by the state change. Consequently, it doesn't matter whether the state change takes place before or after the current lookahead token, since we don't need it to take effect until the second next token, which might be a number.

This grammar will correctly output 26 given the input [10+a]. If we move the hex marker non-terminal to be inside the brackets:

      /* NOT CORRECT */
    | '[' hex e unhex ']'       {$$ = $3;}

then the start condition change happens after the lookahead token, so that [10+a] produces 20.