0
votes

So in my language I want to have dot-syntax expressions:

myObject.myProperty
myObject.myProperty.subProperty

And I want declarations

Object myObject = 1

Additionally, object types can be namespaced:

Object.SubObject mySubObject = 1

A simplified grammar is as follows:

program:
    declaration;
    | expression;

declaration: 
    TOKEN_IDENTIFIER TOKEN_IDENTIFIER '=' TOKEN_INTEGER;
    | TOKEN_IDENTIFIER '.' TOKEN_IDENTIFIER TOKEN_IDENTIFIER '=' TOKEN_INTEGER;

expression:
    TOKEN_IDENTIFIER;
    | expression '.' TOKEN_IDENTIFIER;

Unfortunately, compiling this grammar with Bison gives a shift-reduce conflict. Looking at the state machine output, it seems to me there's an error in the way Bison interprets it. The following is state 1, which is the state after reading the first identifier:

State 1

    3 declaration: "identifier" . "identifier" '=' "integer"
    4            | "identifier" . '.' "identifier" "identifier" '=' "integer"
    5 expression: "identifier" .

    "identifier"  shift, and go to state 5
    '.'           shift, and go to state 6

    "end of code"  reduce using rule 5 (expression)
    '.'            [reduce using rule 5 (expression)]

And state 6 (the default shift state when reading a dot) is only for a declaration:

State 6

    4 declaration: "identifier" '.' . "identifier" "identifier" '=' "integer"

    "identifier"  shift, and go to state 10

It seems to me that, in state 1, there should not be a possibility to reduce upon reading a dot. It should look ahead, and if it sees two identifiers right after each other (no dot in between), it should then shift to a declaration-only state, but if it sees a second dot or end of code, it then reduces to expression. The fact that the rule for a declaration is the only instance where two identifiers can be found side-by-side without a dot between should disambiguate the grammar so there are no shift-reduce errors.

I tried this with ielr and canonical-lr with the same results (don't know if that should matter).

Any ideas? Is my interpretation of how it should work incorrect?

2
What that is telling you is that, if we have seen TOKEN_IDENTIFIER '.' TOKEN_IDENTIFIER, it doesn't know, outside of you providing other hints to it, whether it should reduce, returning an "expression", or shift in another token, because it's seen the first part of a "declaration". Some additional syntax, like requiring both an "expression" and a "declaration" to be terminated by a semicolon or something, would help reduce that ambiguity... With your current rules, x.y a could be either an expression followed by the beginning of another (x.y a.b) or the start of a declaration...twalberg
I don't provide recursion into multiple statements in this grammar, so the grammar can't have multiple statements. And even when I do add a ';' delimiter, I still get the shift/reduce conflict.Anthony Johnson
Looking around, I believe I'm having the same issue as stackoverflow.com/questions/19335353/…, in that, yes, it's an unambiguous grammar, but maybe it's an LR(2), which bison doesn't handle. I thought the ielr option was supposed to fix that. Any ideas on how to adjust the grammar so it's LR(1).Anthony Johnson

2 Answers

3
votes

It should look ahead, and if it sees two identifiers right after each other …

The 1 in LALR(1) means "one token of lookahead". So the parse will not see two identifiers right after each other in the lookahead, because it can only see one identifier.

I thought the ielr option was supposed to fix that.

Nope. The ielr parser is LR(1) (almost). And canonical-lr is LR(1) (precisely). In all cases, it's the same 1.

The easy solution is to use a GLR parser (%glr-parser) which is not restricted to one token of lookahead, because it doesn't use lookahead at all. Instead, it maintains all the possible parse stacks in parallel until it either finds a way to figure out which one is valid, or it knows that it is impossible to resolve the ambiguity. Obviously there is a performance penalty for keeping the multiple stacks, but it's probably not going to be the compiler's bottleneck. The GLR parser will work with any unambiguous grammar without modifications, and in most cases it is not necessary to modify the parser actions either so it's often a good choice.

Otherwise, you could add a "seemingly redundant" production (the phrase comes from a popular compilers text):

expression: TOKEN_IDENTIFIER
          | compound_expression
compound_expression
          : TOKEN_IDENTIFIER '.' TOKEN_IDENTIFIER
          | compound_expression '.' TOKEN_IDENTIFIER

That solution doesn't always scale well to more complicated grammars, though.

1
votes

The normal way of handling this sort of thing in an LALR(1) parser is to factor out the common thing between expressions and declarations that is causing the conflict. In your case, this is an optionally scoped name that might either be the type name in a declaration, or a field reference in an expression. So you refactor that as

name: TOKEN_IDENTIFIER
    | name '.' TOKEN_IDENTIFIER

declaration: name TOKEN_IDENTIFIER '=' expression

expression: name
          | ... other expression rules

The reason this works is that it can now just recognize a name at the beginning and not care whether it is the beginning of a declaration or an expression -- that decision is deferred until it sees the next token after the full name.

Note that this will still fail if you have a rule that allows two consecutive expressions with no token between them.

Also, this grammar is slightly different from your original grammar in that it allows multiple levels of scoping in a declarations such as

Object.SubObject.SubSubObject mySubSubObject = 1

while your grammar only allows 0 or one scope level.