1
votes

Flex sets the YY_STATE to INITIAL by default when yyscan_t is called. I'm trying to make a reentrant scanner that can start with user-specific state instead of INITIAL.

Here is the case

/* comment start       //not passed into flex
   in comment          //first line passed into flex
   end of comment*/    //second line passed into flex

For some reasons these 2 lines are separately fed into the reentrant scanner and the YY_STATE the line belongs to are known. What I need is to pass the comment state into reentrant flex and switch YY_STATE to COMMENT before start lexing in comment\n. My workaround are adding a dummy token in head of a line and passing the state as yyextra into flex. Once the dummy token is recognized, switch to the specific state. Hence flex begins lexing the line with specific YY_STATE. However, adding a dummy token at the beginning of each line is time-consuming.

Here is the way I used to call reentrant flex:

yyscan_t scanner;                                                                                                                                                                                                                
YY_BUFFER_STATE buffer;                                                                                                                                                                                                          
yylex_init(&scanner);                                                                                                                                                                                                            
buffer = yy_scan_string(inputStr, scanner);                                                                                                                                                                                      
yyset_extra(someStructure, scanner);                                                                                                                                                                                                       
yylex(scanner);                                                                                                                                                                                                                  
yy_delete_buffer(buffer, scanner);                                                                                                                                                                                               
yylex_destroy(scanner); 

Is it possible to set YY_STATE before yylex(scanner) is called ?

1
yylex does not reset the start condition when it is called. AFAIK, the only time the start is reset is when you create a new yyscan_t object. Are you doing that?rici
@rici, you're right. I have edited my questionShiang Dza
Why don't you just keep on using the same yyscan_t? It's not sending an extra token that's slow: it's the overhead of creating and destroying scanner states. The scanner state is reusable without problems.rici
I agree with reuse the same yyscan_t will save a lot of redundant init and destroy. In this case the first parsed line is in comment\n not /* comment start\n. Flex won't know this line is in comment unless we let it know.Shiang Dza
How are you really calling yylex? Do you just call it once to lex the entire line, or do you call it for each token? And how do you save the current lexical state?rici

1 Answers

0
votes

If you are only calling yylex once for each input line, then you could just add an extra argument to yylex which provides the start condition to switch to, and set the start condition at the top of yylex.

But there's no simple way to refer to start conditions from outside of the flex file, nor is there a convenient way to extract the current start condition from the yystate_t object. The fact that you claim to have this information available suggests that you are storing it somewhere when you change start states, so you could restore the start state from that same place when you start up yylex. The simplest place to store the information would be the yyextra object, so that's the basis of this sample code:

File begin.int.h

/* This is the internal header file, which defines the extra data structure
 * and, in this case, the tokens.
 */
#ifndef BEGIN_INT_H
#define BEGIN_INT_H

struct Extra {
  int start;
};

enum Tokens { WORD = 256 };

#endif

File begin.h

/* This is the external header, which includes the header produced by
 * flex. That header cannot itself be included in the flex-generated code,
 * and it depends on the internal header. So the order of includes here is
 * (sadly) important.
 */
#ifndef BEGIN_H_
#define BEGIN_H_

#include "begin.int.h"
#include "begin.lex.h"

#endif

File: begin.l

/* Very simple lexer, whose only purpose is to drop comments. */
%option noinput nounput noyywrap nodefault 8bit
%option reentrant
%option extra-type="struct Extra*"
%{
#include "begin.int.h"
/* This macro ensures that start condition changes are saved */
#define MY_BEGIN(s) BEGIN(yyextra->start = s)
%}

%x IN_COMMENT
%%
  /* See note below */
  BEGIN (yyextra->start);
"/*"          MY_BEGIN(IN_COMMENT);
[[:alnum:]]+  return WORD;
[[:space:]]+  ;
.             return yytext[0];

<IN_COMMENT>{
  "*/"        MY_BEGIN(INITIAL);
  .|[^*]+     ;
}

Note:

Any indented code after the first %% and before the first pattern is inserted at the beginning of yylex; the only thing that executes before it is the one-time initialization of the yystate_t object, if necessary.

File begin.main.c

/* Simple driver which creates and destroys a scanner object for every line
 * of input. Note, however, that it reuses the extra data object, which holds
 * persistent information (in this case, the current start condition).
 */
#include <stdio.h>
#include "begin.h"

int main ( int argc, char * argv[] ) {
  char* buffer = NULL;
  size_t buflen = 0;

  struct Extra my_extra = {0};

  for (;;) {
    ssize_t nr = getline(&buffer, &buflen, stdin);
    if (nr < 0) break;
    if (nr == 0) continue; 

    yyscan_t scanner;
    yylex_init_extra(&my_extra, &scanner);

    /* Ensure there are two NUL bytes for yy_scan_buffer */
    if (buflen < nr + 2) {
      buffer = realloc(buffer, nr + 2);
      buflen = nr + 2;
    }
    buffer[nr + 1] = 0;
    YY_BUFFER_STATE b = yy_scan_buffer(buffer, nr + 2, scanner);

    for (;;) {
      int token = yylex(scanner);
      if (token == 0) break;
      printf("%d: '%s'\n", token, yyget_text(scanner));
    }
    yy_delete_buffer(b, scanner);
    yylex_destroy(scanner);
  }
  return 0;
}

Build:

flex -o begin.lex.c --header-file begin.lex.h begin.l
gcc -Wall -ggdb -o begin begin.lex.c begin.main.c