Do either Flex or Bison maintain global state between calls to yyparse()
Flex maintains information about the current input stream. If the parse does not consume the entire input stream (which is quite common for parsers which terminate abnormally on errors), then the next call to yyparse
will continue reading from where the previous one left off. Providing a new input buffer will (mostly) reset the lexer's state, but there may be some aspects which have not been reset, notably the current start condition, and the condition stack if that option has been enabled.
The bison-generated parser does not rely on global state. It is designed to clear its internal state prior to returning from yyparse
. However, if a parser action executes a return
statement directly (this is not recommended), then the cleanup will be bypassed, which is likely to create a memory leak. Actions which prematurely terminate the parse should use the macros YYACCEPT
or YYABORT
rather than a return
statement.
Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?
The default flex-generated parser, which is designed to be called every time a token is required, is heavily reliant on global variables. Most, but not all, of the flex state is maintained in the current YY_BUFFER_STATE
(which is kept in a global variable), and that object can be reset by the yyreset
function, or any of the functions which provide a character buffer as lexer input. However, these functions do not reset the start condition nor do they flush the condition stack (if enabled), or the buffer stack. If you want to reset the state completely, you need to flush the stacks manually, and reset the start condition with BEGIN(INITIAL)
.
One approach to making a more easily restartable scanner is to build a reentrant scanner. A reentrant scanner keeps all of its state (including start conditions and buffer stack) in a scanner structure, which means that you can completely reset the scanner state simply by creating a new scanner structure (and, of course, destroying the old one to avoid leaking memory.)
There are lots of good reasons to use reentrant scanners [Note 1]. For one thing, it allows you to have more than one parser active at the same time, and it eliminates a reliance on global state. But unfortunately, it's not as simple as just setting a flex options.
Reentrant scanners have a different API (which includes a pointer to the scanner state structure). This state structure needs to be passed into yyparse
and yyparse
needs to pass it to yylex
; all of this requires some modifications to the bison options. Also, reentrant scanners cannot use the global yylval
to communicate the semantic value of a token to the parser [Note 2].
If you use the %bison-bridge
option and tell bison to generate a reentrant parser, then yylex
will expect to be called with another additional parameter (or two, if you use locations), and the reentrant bison parser will supply the additional parameters. That all works fine, but it has the effect of changing yylval
(and yylloc
, if used) to a pointer, which means that you need to go through all the scanner actions changing yylval.something
to yylval->something
.
Notes
You can also create a reentrant parser, using some additional bison options. Normally, the only mutable globals used by a bison-generated parser are yylval
and yylloc
(if you use location reporting). (And yynerrs
, but it is rare to refer to that variable outside of a parser action.) Specifying a reentrant parser turns those globals into lexer arguments, but it does not create an externally visible parser state structure. But it also gives you the option of using a "push parser", which does have a persistent parser state structure. In some cases, the flexibility of push parsers can significantly simplify scanners.
Strictly speaking, nothing stops you from creating a reentrant scanner which still uses globals to communicate with the parser, except that it is not really reentrant any more. I wouldn't recommend this option for obvious reasons, but you might want to do it as a transitional strategy, since it requires less modification to the parser and to scanner actions.