7
votes

I have flex code that copies a string lexeme using strdup().

%{   
#include "json.tab.h"
#define YY_DECL extern "C" int yylex()

%}
%option noyywrap

%%

[ \t\n]+ ; 
\"[a-zA-Z]+\" {yylval.sval = strdup(yytext); return STRING; }
[0-9]+ {yylval.ival = atoi(yytext); return NUMBER; }
. {return yytext[0];} ; 

%%

strdup() allocates memory and copies the input string into it and return (strdup() - what does it do in C?), so I guess I need to free it up when I don't need it anymore.

From this post:When is %destructor invoked in BISON?, I added %destructor { free($$); printf("free");} STRING in the yacc file.

However, I don't see the free() is invoked even when yylval.sval is assigned with new string returned from strdup().

What might be wrong? How to free the allocated string in flex/bison?

Added

I think about using statically allocated sval as follows:

%union {
    int ival;
    char sval[100]; // char* sval;
}

The flex code now becomes (without the checking code if yytext is less than 100 bytes):

\"[a-zA-Z]+\" {
    //yylval.sval = strdup(yytext);
    memset(yylval.sval, 0, 100);
    strcpy(yylval.sval, yytext);
    return STRING; 
}

I'm not sure if this approach is what people normally use.

Added2

For my application, simple interning is OK.

extern char buffer[]; // [100];
%}
%option noyywrap

%%

\"[a-zA-Z]+\" {
        //yylval.sval = strdup(yytext);
        memset(buffer, 0, 100);
        strcpy(buffer, yytext);
        yylval.sval = buffer;
        return STRING; 
    }
...

char buffer[100];

For yacc code

%union {
    int ival;
    char *sval; 
}
1
You know that c and c++ are not the same? Don't use strdup() in c++, use std::string and forget about alocating or freeing memory for a string.Iharob Al Asimi
@iharob: I guess I don't understand your answer. Could you elaborate how to use std::string to assign it to yylval variable?prosseek
I don't remember how, but it's certainly possible. The problem is that you are generating a c parser, and using c++ code. Read about c++ parsers.Iharob Al Asimi
@iharob: The best you could do is put a std::string * in the union, and that requires you to still be concerned with memory management.Mark Beckwith

1 Answers

9
votes

As you say, you need to free the string "when I don't need it anymore." It's as simple (or complicated) as that.

C does not have a garbage collector, and C programmers are therefore responsible for knowing when allocated memory is no longer needed. The language does not attempt to figure it out, and (mostly) neither does bison.

If you have a reduction rule which is provided with one or more semantic values containing pointers to allocated memory, that rule might do any of a number of things. It might pass the semantic values into the new semantic value, typically by copying only the pointer. It might copy the semantic value, and then free the original. It might add the semantic value to a parse-global datastructure, like a symbol table.

In all of those cases, the programmer should be aware of whether or not the allocated memory is still required, and should call free the allocation if it is not.

However, there are a few cases in which bison will discard a semantic value without it ever being presented to a reduction action. Most of these are error conditions. If as part of error recovery, bison decides to discard a token, that token's semantic value could leak memory. And it is precisely for this case that bison has a %destructor declaration. The %destructor code is called if (and only if) bison discards the token as a result of error recovery or post-error clean-up. All other cases are your responsibility.

Trying to evade this responsibility by making stack slots enormous (such as including a char[100] in the semantic value union) is both unsafe and inefficient. It's unsafe because you need to be constantly aware that the fixed space buffer could overflow, meaning that parsing a syntactically valid program might overwrite arbitrary memory. It's inefficient because you end up making the stack several orders of magnitude larger than necessary; and also because you end up constantly copying the stack slots (at least twice for every reduction rule, even the ones which use the default action.)

Figuring out the lifetime of a semantic value is only complicated if you intend to share memory. That's not usually useful for string literals (as in your example) but it can be quite helpful for variable names; most names occur more than once in a program, so there is always the temptation to use the same character string for each occurrence.

I usually solve the identifier problem by "interning" the string in the lexer. The lexer maintains a parse-global name table -- say, a simple set implemented with a hash-table -- and for each identifier it encounters, it adds the identifier to the name table and passes the unique name entry pointer as the semantic value. At some point after the end of the parse, the entire name table can be freed, freeing all the identifiers.

For string literals and other probably-unique strings, you could either use the name table anyway, or you could avoid ever having two copies of a pointer to the same character string. Using the name table has the advantage of reducing the amount of work you need to do in memory management, but at the cost of possibly keeping unnecessary strings around for extra time. That depends a lot on the nature of the parse result: if it is an AST, then you probably need to keep the character strings as long as the AST exists, but if you are doing direct execution or one-pass code generation, you might not need the string literals in the long haul.