0
votes

I am trying to write a parser that recognizes a particular syntax. I am pretty far and most of the stuff works. I came across a problem with the Bison parser though when trying to enable it to recognize two similar but distinct constructs. Here are the two constructs I want it to recognize:

type_1 IS ENUMERATION OF 
BEGIN
    element_1,
    element_2 := 4
END type_1;

type_2 IS ENUMERATION SIZE 8 OF 
BEGIN
    element_1,
    element_2 := 4
END type_2;

Basically this is for a enum like declaration that can be given with or without a bitsize parameter for the elements.

The Bison file I wrote contains two alternatives for the two versions. Each of them separate, when either being alone, or when it comes before the other one, works well. However when I put both description into the Y file, it only recognizes one of my declarations or the other.

Here are the lines of Bison code:

    enum_type : sized_enum_type                 { $$=$1; P3; }
          | nonsized_enum_type              { $$=$1; P3; }
          ;

sized_enum_type : identifier[L]             {P1(37,$L->Get_Line(),$L->Get_Column());} 
                IS[I]                       {P2(38,$I.ln,$I.clmn);} 
                ENUMERATION[N]              {P2(43,$N.ln,$N.clmn);} 
                SIZE[Z]                     {P2(42,$Z.ln,$Z.clmn);} 
                INTEGER_VALUE[V]            {P2(39,$V.ln,$V.clmn);} 
                OF[O]                       {P2(40,$O.ln,$O.clmn);} 
                TBEGIN[B]                   {P2(16,$B.ln,$B.clmn);} 
                enum_element_list[S]        {P2(41,$S->Get_Line(),$S->Get_Column());} 
                TEND[E]                     {P2(16,$E.ln,$E.clmn);} 
                identifier[R]               {P2(15,$R->Get_Line(),$R->Get_Column());}
                SEMICOLON                   {$$= new AST::AST_Type_Enum( (AST::AST_Code_Identifier *) $L, $V.value ); $$->_Set_Line_And_Column(  $I.ln, $I.clmn); P3;}
                ;

nonsized_enum_type : identifier[L]              {P1(37,$L->Get_Line(),$L->Get_Column());} 
                     IS[I]                       {P2(38,$I.ln,$I.clmn);} 
                     ENUMERATION[N]              {P2(39,$N.ln,$N.clmn);} 
                     OF[O]                       {P2(40,$O.ln,$O.clmn);} 
                     TBEGIN[B]                   {P2(16,$B.ln,$B.clmn);} 
                     enum_element_list[S]        {P2(41,$S->Get_Line(),$S->Get_Column());} 
                     TEND[E]                        {P2(16,$E.ln,$E.clmn);} 
                     identifier[R]              {P2(15,$R->Get_Line(),$R->Get_Column());}

                     SEMICOLON                  {$$= new AST::AST_Type_Enum( (AST::AST_Code_Identifier *) $L, -1 ); $$->_Set_Line_And_Column(  $I.ln, $I.clmn); P3;}
                     ;
enum_element_list   : one_enum                                                          { $$=$1; }
                   | one_enum COMMA enum_element_list                                   { ((AST::AST_Enum_Value *)$1)->Append( (AST::AST_Enum_Value *) $3); $$=$1; }
                   ;

one_enum    : identifier[L]                 {$$= new AST::AST_Enum_Value( (AST::AST_Code_Identifier *) $L, -1 ); $$->_Set_Line_And_Column( $L->Get_Line(),$L->Get_Column()); P3;}
           | identifier[L]                  {P1(17,$L->Get_Line(),$L->Get_Column());} 
             ASSIGNMENT_SHALLOW_COPY[A]    {P2(42,$A.ln, $A.clmn);} 
             INTEGER_VALUE[I]               {$$= new AST::AST_Enum_Value( (AST::AST_Code_Identifier *) $L, $I.value ); $$->_Set_Line_And_Column( $L->Get_Line(),$L->Get_Column()); P3;}
             ;

I dont think the C-statements on the right side of each rule part are necessary but I include them here nevertheless.

When I run the compiled parser on my sourcecode it complaints that it was expecting the token SIZE. When I exchange the rules for nonsized_enum_type and sized_enum_type (not the right side in the rule for enum_type, the complete source code order for the rules, move one rule below the other) then it recognizes the other form without size. When I include the SIZE 8 part in my code, then it complains that it was expecting the other form (OF expected).

So my question is: How is it possible to write a parser rule that recongizes both parts?

Shouldn't the generated code recognize that one rule is leading nowhere and then try the other one? It appears the second rule is never tried.

Thanks everyone

1

1 Answers

1
votes

Inserting all those mid-rule actions (MRAs) into your grammar cripples the parser. Since you don't include the definitions of P1 and P2, it's hard to know why you need the MRAs, but it seems unlikely that it is actually necessary to, for example, execute a specific reduction action between the tokens IS and ENUMERATION, rather than executing the reduction action a bit later.

Specifically, the problem can be reduced to the following. Consider the two productions (only up to their first MRA):

sized_enum_type : identifier[L]     {P1(37,$L->Get_Line(),$L->Get_Column());} 
nonsized_enum_type : identifier[L]  {P1(37,$L->Get_Line(),$L->Get_Column());}

Each of those MRAs is implemented with a tag production. The logic of left-to-right parsing is that all reductions must be performed before shifting the next token. In both the above cases, the next token is IS, but seeing the IS does not clarify which reduction should be performed. As a result, bison will report a reduce/reduce conflict, and then choose the production which comes first in the definition file. The other reduction will effectively be eliminated, which makes the rest of its production unreachable.

The two reduction actions are, as it happens, identical, but bison doesn't know that. From bison's perspective, all MRAs are unique. In effect, the use of a MRA requires the grammar -- at least at that particular point -- to be LL(1), removing all advantage of the LR(1) parsing algorithm.

You could avoid this problem by left-factoring, but I'd strongly suggest that you try to rely less on MRAs. For building an AST, MRAs are practically never necessary, and they have a non-trivial cost as well as reducing the ability of the parser to cope with non-LL(1) grammars.