3
votes

As part of a toy project I've been trying to make a small modification of someone else's parser based on flex/bison. I'm really not experienced with either. You can find the original parser here.

I've been trying to put together a simple function that accepts a string and returns a parse tree, so I can expose this via FFI for use in another programming language. What I have is mostly based on the main() function in the original program, my butchered version is below:

TreeNode* parse_string(char *s)
{
    FILE *in = fmemopen(s, strlen(s), "r");
    lex2_initialise();
    parse_file(in);
    fclose(in);
    preprocess_tokens();
    yyparse();
    return top;
}

This actually works fine, at least the first time I call it. The second time it complains about misparsed tokens, and the error reporting function used appears to be called from somewhere inside a maze of goto statements within the generated parser during the call to yyparse(), at which point I don't understand what's going on anymore.

The original program itself only appears to be designed to take all its input upfront and then exit, so it doesn't leave me with much clue of what I'm missing. Putting aside the not-altogether-outlandish idea some old state is being retained elsewhere in the rest of the program, my main questions are:

  • Do either Flex or Bison maintain global state between calls to yyparse()
  • Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?
2

2 Answers

4
votes

Even if you are using a non-reentrant parser, you can use yylex_destroy (without arguments) after lexing to force an initialisation, the next time the the lexer is invoked:

extern int yylex_destroy(void);
...
// do parsing here
...
yylex_destroy()

For reentrant parsers see here.

3
votes

Do either Flex or Bison maintain global state between calls to yyparse()

Flex maintains information about the current input stream. If the parse does not consume the entire input stream (which is quite common for parsers which terminate abnormally on errors), then the next call to yyparse will continue reading from where the previous one left off. Providing a new input buffer will (mostly) reset the lexer's state, but there may be some aspects which have not been reset, notably the current start condition, and the condition stack if that option has been enabled.

The bison-generated parser does not rely on global state. It is designed to clear its internal state prior to returning from yyparse. However, if a parser action executes a return statement directly (this is not recommended), then the cleanup will be bypassed, which is likely to create a memory leak. Actions which prematurely terminate the parse should use the macros YYACCEPT or YYABORT rather than a return statement.

Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?

The default flex-generated parser, which is designed to be called every time a token is required, is heavily reliant on global variables. Most, but not all, of the flex state is maintained in the current YY_BUFFER_STATE (which is kept in a global variable), and that object can be reset by the yyreset function, or any of the functions which provide a character buffer as lexer input. However, these functions do not reset the start condition nor do they flush the condition stack (if enabled), or the buffer stack. If you want to reset the state completely, you need to flush the stacks manually, and reset the start condition with BEGIN(INITIAL).

One approach to making a more easily restartable scanner is to build a reentrant scanner. A reentrant scanner keeps all of its state (including start conditions and buffer stack) in a scanner structure, which means that you can completely reset the scanner state simply by creating a new scanner structure (and, of course, destroying the old one to avoid leaking memory.)

There are lots of good reasons to use reentrant scanners [Note 1]. For one thing, it allows you to have more than one parser active at the same time, and it eliminates a reliance on global state. But unfortunately, it's not as simple as just setting a flex options.

Reentrant scanners have a different API (which includes a pointer to the scanner state structure). This state structure needs to be passed into yyparse and yyparse needs to pass it to yylex; all of this requires some modifications to the bison options. Also, reentrant scanners cannot use the global yylval to communicate the semantic value of a token to the parser [Note 2].

If you use the %bison-bridge option and tell bison to generate a reentrant parser, then yylex will expect to be called with another additional parameter (or two, if you use locations), and the reentrant bison parser will supply the additional parameters. That all works fine, but it has the effect of changing yylval (and yylloc, if used) to a pointer, which means that you need to go through all the scanner actions changing yylval.something to yylval->something.

Notes

  1. You can also create a reentrant parser, using some additional bison options. Normally, the only mutable globals used by a bison-generated parser are yylval and yylloc (if you use location reporting). (And yynerrs, but it is rare to refer to that variable outside of a parser action.) Specifying a reentrant parser turns those globals into lexer arguments, but it does not create an externally visible parser state structure. But it also gives you the option of using a "push parser", which does have a persistent parser state structure. In some cases, the flexibility of push parsers can significantly simplify scanners.

  2. Strictly speaking, nothing stops you from creating a reentrant scanner which still uses globals to communicate with the parser, except that it is not really reentrant any more. I wouldn't recommend this option for obvious reasons, but you might want to do it as a transitional strategy, since it requires less modification to the parser and to scanner actions.