Yacc grammar producing incorrect terminal

Question

I've been working on a hobby compiler for a while now, using lex and yacc for the parsing stage. This is all working fine for the majority of things, but when I added in if statements, the production rule for symbols is now giving the previous (or next?) item on the stack instead of the symbol value needed.

Grammar is given below with hopefully unrelated rules taken out:

%{
       ...
%}


    %define parse.error verbose


    %token ...

    %%


    Program:
            Function                                            { root->addChild($1);}      
            ;


    Function:
            Type Identifier '|' ArgumentList '|' StatementList END
                                                                { $$ = new FunctionDef($1, $2, $4, $6); }


    /******************************************/
    /* Statements and control flow ************/
    /******************************************/

    Statement:
            Expression Delimiter
            | VariableDeclaration Delimiter
            | ControlFlowStatement Delimiter
            | Delimiter
            ;

    ControlFlowStatement:
            IfStatement
            ;

    IfStatement:
            IF Expression StatementList END                       { $$ = new IfStatement($2, $3); }
            | IF Expression StatementList ELSE StatementList END  { $$ = new IfStatement($2, $3, $5);}
            ;

    VariableDeclaration:
            Type Identifier                                     { $$ = new VariableDeclaration($1, $2);}
            | Type Identifier EQUALS Expression                 { $$ = new VariableDeclaration($1, $2, $4);}
            ;

    StatementList:
            StatementList Statement                             { $1->addChild($2);             }
            | Statement                                         { $$ = new GenericList($1);     }
            ;


    Delimiter:
            ';'
            | NEWLINE
            ;
    Type:
           ...
Expression:
    ...

    PostfixExpression:
            Value '[' Expression ']'                            { std::cout << "TODO: indexing operators ([ ])" << std::endl;}
            | Value '.' SYMBOL                                  { std::cout << "TODO: member access" << std::endl;}
            | Value INCREMENT                                   { $$ = new UnaryExpression(UNARY_POSTINC, $1);  }
            | Value DECREMENT                                   { $$ = new UnaryExpression(UNARY_POSTDEC, $1);  }
            | Value '(' ')'                                     { $$ = new FunctionCall($1, NULL);    }
            | Value '(' ExpressionList ')'                      { $$ = new FunctionCall($1, $3);                }
            | Value
            ;


    Value:
            BININT                                              { $$ = new Integer(yytext, 2);                  }
            | HEXINT                                            { $$ = new Integer(yytext, 16);                 }
            | DECINT                                            { $$ = new Integer(yytext);                     }
            | FLOAT                                             { $$ = new Float(yytext);                       }
            | SYMBOL                                            { $$ = new Symbol(yytext);                      }
            | STRING                                            { $$ = new String(yytext);                      }
            | LambdaFunction
            | '(' Expression ')'                                { $$ = $2;                                      }
            | '[' ExpressionList ']'                            { $$ = $2;}
            ;

    LambdaFunction:
            ...


    %%

I cannot work out what about the control flow code can make the Symbol: rule match something that isn't classed as a symbol from the lex definition:

symbol                      [a-zA-Z_]+(alpha|digit)*
...
{symbol}                    {return SYMBOL;}

Any help from somebody who knows about yacc and grammars in general would be very much appreciated. Also example files of the syntax it parses can be shown if necessary.

Thanks!

TL;DR! Narrow down the grammar to the minimum needed to support the problematic rule. For example, you probably don't need the full expression hierarchy, or most other statements, or declarations. If it works then, then add one more level for the expressions, and continue until you get the error. And if you get the error even with the minimal grammar, at least you eliminated many possible sources, and can edit your question to include only the minimal grammar. — Some programmer dude
I was thinking the same thing myself, that it might be too long to post but I decided to err on the side of caution with not knowing where the error is! You're right though I could do without the expression stack et al. — Cameron Diver
In regard to stripping back the grammar to find the source of the problem, I did do that, and found it to be the control flow code, but I just cannot figure out how or why. — Cameron Diver
Look at the IF rule again, are you sure you want a statement as the condition? Right now it allows for infinite recursion (it allows if if if if if if ... and so on). — Some programmer dude
That's actually a left-in test I forgot to take out, I've been stumbling around in the dark for a while! Its supposed to be expression but that still doesnt help :/ thanks for the reply. — Cameron Diver

rici rici · Accepted Answer · 2015-07-05T23:32:28

You cannot count on the value of yytext outside of a flex action.

Bison grammars typically read a lookahead token before deciding on how to proceed, so in a bison action, yytext has already been replaced with the token value of the lookahead token. (You can't count on that either, though: sometimes no lookahead token is needed.)

So you need to make a copy of yytext before the flex action returns and make that copy available to the bison grammar by placing it into the yylval semantic union.

See this bison FAQ entry

By the way, the following snippet from your flex file is incorrect:

symbol                      [a-zA-Z_]+(alpha|digit)*

In that regular expression, alpha and digit are just ordinary strings, so it is the same as [a-zA-Z_]+("alpha"|"digit")*, which means that it will match, for example, a_digitdigitdigit but not a_123. (It would have matched a_digitdigitdigit without the part following the +, so I presume that wasn't your intention.)

On the whole, I think it's better to use Posix character classes than either hand-written character classes or defined symbols, so I would write that as

symbol    [[:alpha:]_]([[:alnum:]_]*[[:alnum:]])?

assuming that your intention is that a symbol can start but not end with an underscore, and end but not start with a digit. Using Posix character classes requires you to execute flex with the correct locale -- almost certainly the C locale -- but so do character ranges, so there is nothing to be lost by using the self-documenting Posix classes.

(Of course, I have no idea what your definitions of {alpha} and {digit} are, but it seems to me that they are either the same as [[:alpha:]] and [[:digit:]], in which case they are redundant, or different from the Posix classes, in which case they are confusing to the reader.)

Yacc grammar producing incorrect terminal

1 Answers