0
votes

In an attempt to recreate Python's blocks defined by the indentation, I've stumbled on this right at the start.

When I try my lexer/scanner separately it returns me the expected results, rightly using the start conditions I've made. But when coupling it with the Bison parser the right state is not kept and I receive tokens from an unexpected state.

The expected behavior for me would be returning "INDENT" tokens for tabs/spaces at the beginning of a line, and after finding another symbol(not tab/space) returning "OTHER" tokens for every symbol, until starting a new line.

First case, lexer returning expected results

scanner.l

%{
  #include <iostream>
%}

%option noyywrap

%x INDENT
%%

  BEGIN(INDENT);

<INDENT>[ \t] { std::cout << "INDENT "; }
<INDENT>.|\n { yyless(0); BEGIN(INITIAL); }

\n { std::cout << std::endl; BEGIN(INDENT); }
. { std::cout << "OTHER "; }

%%

int main(){
  yylex();
  return 0;
}

Entering "  test  " (two spaces before and after "test") returns "INDENT INDENT OTHER OTHER OTHER OTHER OTHER OTHER".

Second case, parser returning unexpected results

scanner.l

%{
  #include <iostream>

  #include "parser.h"
%}

%option noyywrap

%x INDENT
%%

  BEGIN(INDENT);

<INDENT>[ \t] { return T_INDENT; }
<INDENT>.|\n { yyless(0); BEGIN(INITIAL); }

\n { BEGIN(INDENT); return T_NEWLINE; }
. { return T_OTHER; }

%%

parser.y

%{
  #include <iostream>

  extern int yylex();

  void yyerror(const char *s);
%}

%define parse.error verbose

%token T_INDENT T_OTHER T_NEWLINE

%%

program : program symbol
        | %empty
        ;

symbol : T_INDENT { std::cout << "INDENT "; }
       | T_NEWLINE { std::cout << std::endl; }
       | T_OTHER { std::cout << "OTHER "; }
       ;

%%

void yyerror(const char *s){
  std::cout << s;
}

int main(){
  yyparse();
  return 0;
}

Entering "  test  " (same as before) returns "INDENT INDENT OTHER OTHER OTHER OTHER INDENT INDENT". While the expected result was the same as above.

The Bison parser seems to be receiving the wrong tokens, as if it was not respecting the start conditions. I've read something about the parser messing up the start conditions because of the lookahead behavior, but I'm not sure the problem is within this nor how I would counter it.

1
This is a very strange way to do it. I would have ^[ \t] BEGIN INDENT;, not start in that state.user207421

1 Answers

0
votes

Because you have

BEGIN(INDENT)

with no pattern in the rules section, this is copied verbatim to the top of the yylex function, so it runs every time yylex is called. So every time bison calls yylex to get a new token, the state is reset to INDENT and you get the T_INDENT tokens.

In your "First Case" example, the lexer doesn't return until EOF, so you only call it once, and it only sets the INDENT state that once.

If you want this code to only run the first time you call yylex, you need to set it up so it only runs once. Something like:

        { static bool not_first_time;
          if (!not_first_time) {
            BEGIN(INDENT);
            not_first_time = true; } }

Alternately, set things up so that INITIAL is the expected initial state.