1
votes

If I am reading a string with Flex with the following code:

     %x str

     %%
             char string_buf[MAX_STR_CONST];
             char *string_buf_ptr;


     \"      string_buf_ptr = string_buf; BEGIN(str);

     <str>\"        { /* saw closing quote - all done */
             BEGIN(INITIAL);
             *string_buf_ptr = '\0';
             /* return string constant token type and
              * value to parser
              */
             }

     <str>\n        {
             /* error - unterminated string constant */
             /* generate error message */
             }

     <str>\\[0-7]{1,3} {
             /* octal escape sequence */
             int result;

             (void) sscanf( yytext + 1, "%o", &result );

             if ( result > 0xff )
                     /* error, constant is out-of-bounds */

             *string_buf_ptr++ = result;
             }

     <str>\\[0-9]+ {
             /* generate error - bad escape sequence; something
              * like '\48' or '\0777777'
              */
             }

     <str>\\n  *string_buf_ptr++ = '\n';
     <str>\\t  *string_buf_ptr++ = '\t';
     <str>\\r  *string_buf_ptr++ = '\r';
     <str>\\b  *string_buf_ptr++ = '\b';
     <str>\\f  *string_buf_ptr++ = '\f';

     <str>\\(.|\n)  *string_buf_ptr++ = yytext[1];

     <str>[^\\\n\"]+        {
             char *yptr = yytext;

             while ( *yptr )
                     *string_buf_ptr++ = *yptr++;
             }

Which is taken from here and here. Now Flex only returns yytext or one of the characters in yytext but my string is stored in 'string_buf_ptr'. How do I retrieve it in my scanner? As has been pointed here, modifying yytext beyond current token can cause complications. So what is the to return this string to a simple scanner that does only this:

ntoken = yylex();

while(ntoken) {
   prinf("%s\n", yytext);
   ntoken = yylex();
}
2

2 Answers

1
votes

There are lots of solutions to this problem, but the only really safe one is to allocate a new string buffer (with malloc, usually), fill it in, and pass it to the caller.

If the caller is a bison-generated parser, it will expect the "semantic value" of the token to be in the variable yylval, which is most likely a union. In traditional flex/bison scanner/parser arrangements, yylval will be a global. That works fine if you only have one scanner and one parser in the program -- and indeed, a huge number of language prototypes have been built in this fasion -- but it's a bit ugly when viewed from a modern programming perspective.

Fortunately, bison and flex have also evolved, and you can get rid of globals by telling flex to build a reentrant lexer and bison to build a pure parser. Moreover, you can provide the bison_bridge option to flex which will cause it to create the correct calling prototype for yylex without much more work on your part.

One downside of allocating a new string buffer every time is that you need to free it in your parser. But the upside is that the returned string buffer does not get overwritten on the next call to yylex, which makes it possible to use with bison. (bison and many other parser generators assume that it is possible to read one (or more) tokens in advance.) In that case, you cannot rely on any static state of the lexer, because by the time the bison reduction takes place, the lexer has already been called again and discarded the previous state.

The other downside is that you need to maintain the buffer at the correct size. Fortunately, that's easy even if you aren't using C++ strings, which is the strategy I usually recommend to beginners. Here's a very simple buffer management strategy which is surprisingly efficient except on certain platforms whose names begin with W:

size_t bf = 0;
char* bfsz = 0;
#define PUTCHAR(ch) do {               \
  char* newbf = realloc(bf, ++bfsz);   \
  if (!newbf) {                        \
    fputs("Out of memory!\n", stderr); \
    exit(2);                           \
  }                                    \
  bf = newbuf;                         \
  bf[bfsz] = ch;                       \
  bf[bfsz+1] = 0;                      \
} while(0)

That relies on realloc adjusting the allocation to some exponentially increasing and rapidly doing nothing if there is actually enough space. Many realloc implementations work just in that fashion. If realloc is no inlined, that code does a lot of extra calls, which slows it down a bit. But it's good enough for quick hacks and you an always improve it later.

0
votes

if you declare string_buf_ptr as a global, it will be accessible to all of your code without further modification. Probably you'd want to include it e.g. in "myglobals.h" as something like

extern char *string_buf_ptr;

and include that header file in your flex file (as well as any other code files where the code needs to access string_buf_ptr). Then, before calling main(), declare it e.g.

char string_buf_ptr[1024];

A potentially better way to do it would be to pass the memory buffer to flex without using a global variable. You can do this using yyextra (for details, see the Flex manual). The basic approach would be something like:

Create a structure such as

struct mystruct {
  char string_buf_ptr[1024]; /* or you can malloc this before calling flex */
};

Then before you call yylex, you can do something like:

main() {
  ...
  struct mystruct lex_data;
  memset(&lex_data, 0, sizeof(lex_data));
  yylex_init_extra(&lex_data, &yyscanner_pointer);
  ...
  yylex(yyscanner_pointer);
}

Then, in your lex code, change references to string_buf_ptr instead to point to

((struct mystruct *)yyextra)->string_buf_ptr

Feel free to comment if either of those approaches don't work.