2
votes

I am trying to describe a grammar in bison but I am unsure if it can be done. My intended grammar is this:

%token A B C D SEP

%%

items          : /* empty */
               | items_nonempty
               ;

items_nonempty : item
               | items_nonempty SEP item
               ;

item           :       B
               |       B       SEP D
               |       B SEP C
               |       B SEP C SEP D
               | A SEP B
               | A SEP B       SEP D
               | A SEP B SEP C
               | A SEP B SEP C SEP D
               ;

"items" is a (possible empty) sequence of item elements, separated by a SEP token.

Each item consists of up to 4 tokens (A B C D), in that order, separated by SEP. The A, C, and D tokens in an item are optional.

Note the re-use of the same separator token SEP within each item, and between the items themselves.

I hope the intended grammer is clear. I think it is unambiguous, but I am quite unsure if it is sufficiently restricted to be parseable by bison – unfortunately, my parser knowledge is quite rusty.

Using the grammar as given, bison reports 4 shift/reduce conflicts. Looking at the 'output' I understand where they occur and why; but I am at a loss how (and if) the intended grammar can be written to get rid of the S/R conflicts.

I am unwilling to use an %expect declaration. Likewise, I am unwilling to have my scanner consume the separator tokens rather than have them pass on to the parser.

Any hints on how to sanitise this grammar would be greatly appreciated.

4

4 Answers

1
votes

The basic problem is that the grammar as written needs TWO tokens of lookahead to decide when it has found the end of an item and can thus reduce it or if there's another piece of the current item after the SEP it sees as the next character of the lookahead.

There are a number of approaches you can try

  • use btyacc or bison's GLR support to effectively get more lookahead.

  • write the grammar to accept an arbitrary list of single items and then use a post-pass to regroup them into sets of 1-4 items with at least 1 B and reject malformed sets (This is Gunther's suggestion)

  • use the scanner to do more lookahead -- instead of returning simple SEP tokens, return SEP_BEFORE_A_OR_B or SEP_NOT_BEFORE_A_OR_B depending on what the next token after the SEP is.

  • combine tokens in the scanner -- return SEP_C and SEP_D as single tokens (a separator followed by a C or D)

0
votes

The grammar indeed is unambiguous, it is LL(7) and (without having verified) I believe it is LR(2), presumably even LALR(2). So if you had a generator for either of those, it would do the job.

The lookahead-1 conflicts arise from using the same separator both between items, and within items, and they will vanish if you either give up the separator, or the item structure.

So what you could do is parse twice, with different grammars. In the first pass, you could verify separators being positioned properly, and the grammar could be something like

items          :
               | items_nonempty
               ;
items_nonempty : item
               | items_nonempty SEP item
               ;
item           : A
               | B
               | C
               | D
               ;

In the second (and more important) pass, you would verify the item structure. This could be

items          :
               | items_nonempty
               ;
items_nonempty : item
               | items_nonempty item
               ;
item           : B
               | B D
               | B C
               | B C D
               | A B
               | A B D
               | A B C
               | A B C D 
               ;

where you have the separators ignored by the lexer.

Both of the above are LALR(1), the former is LL(1) and the latter is LL(4), but it could be made LL(1) by some factoring.

This would be a solution that is independent of special offerings as available by bison, or the lexer tool. I am looking forward to learning about what can be done from their part.

0
votes

This has one shift/reduce conflict:

%token A B C D SEP

%%

items
    : /* empty */
    | items_nonempty
    ;

items_nonempty
    : item
    | items_nonempty SEP item
    ;

item
    : opt_a B opt_c_d_list
    ;

opt_a
    :   /* Nothing */
    |   A SEP
    ;

opt_c_d_list
    :   /* Nothing */
    |   opt_c_d_list c_or_d
    ;

c_or_d
    :   SEP C
    |   SEP D
    ;

The opt_a rule alone changes the S/R count from 4 to 2. The residual problem is that the same SEP sepaerates a C or a D after the B, and Yacc can't look ahead. You would need a semantic check to outlaw 'B SEP D SEP C'; the rules above would allow that.

Can you consider modifying your tokenizer to return C on reading SEP C?, and D on reading SEP D? You might even use lexical feedback and a start condition in flex, so that on reading the B, you flip a switch so that SEP C is returned as just C, and SEP D is returned as just D. If this was possible, the following unambiguous grammar would work with no S/R conflicts:

%token A B C D SEP

%%

items
    : /* empty */
    | items_nonempty
    ;

items_nonempty
    : item
    | items_nonempty SEP item
    ;

item
    : opt_a B opt_c opt_d
    ;

opt_a
    :   /* Nothing */
    |   A SEP
    ;

opt_c
    :   /* Nothing */
    |   C
    ;

opt_d
    :   /* Nothing */
    |   D
    ;
0
votes

You can include the SEP following your other tokens into a single rule. Written very concisely, your grammer could be expressed like this:

%token A B C D SEP
%%
items : /* empty */ | item | itemsSEP item ;
item : a B | a b C | a b c D ;
itemsSEP : itemSEP | itemsSEP itemSEP ;
itemSEP : a b c d ;
a : /* empty */ | A SEP ;
b : B SEP ;
c : /* empty */ | C SEP ;
d : /* empty */ | D SEP ;

So now I have itemSEP for an item followed by a separator, but item for the last one which is not followed by a separator. Those are made up from the lower-case single-letter non-terminals which each include the following separator as well, and take care of some arguments being optional. Only the last argument to item is always a raw terminal, as there won't be a separator following that one.

With the grammar expressed in this way, you won't run into any shift-reduce conflicts, as the grammar now is LALR(1). At every step, it will know exactly what reduction to apply, even if the main point of that rule is getting rid of one SEP so we can look one token further ahead.