3
votes

I've recently wrapped my head around LALR enough to write an LALR generator, and I'm trying to construct a java- or c#-style grammar for it (the beginnings of which are specified here).

I know it's extra effort to write the parser generator, like re-inventing the wheel (why not use Antlr?), but my goal is to bootstrap a hobby OS that can compile itself without depending on a third-party toolchain. My problem though is not with the generator, but with the grammar.

I'm running into reduce/reduce ambiguities with statements and expressions.

I know how to resolve certain types of ambiguities, such as dangling-else, but these few have are not intuitive to me, and they have me stumped.

What's the best way to resolve these? Also, is there a prototyping tool that I can use to help visualize the solution? Or, should I go back to square one and try implementing a GLR parser generator for the grammar?

These statements are legal:

Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
                                          // var-decl -> type-name var-decl-list

t = 99;                           // simple-stmt -> assign

i++;                              // simple-stmt -> incr-decr
                                  // incr-decr -> primary-expr ++

json.deserialize<list<int>>(obj); // simple-stmt -> call
                                  // call -> primary-expr ( params )
                                  // ...  -> primary-expr . basic-name ( params )
                                  // ...  -> basic-name . basic-name ( params )

Here's how it's set up:

basic-name : ident < type-list >
           | ident

nested-name : nested-name . basic-name
            | basic-name

basic-type : int | bool | ...

type-name : nested-name
          | basic-type

stmt : var-decl ;
     | simple-stmt ;
     | ...

var-decl : type-name var-decl-list

var-decl-list : var-decl-list , var-init
              | var-init

var-init : ident assign-op expression
         | ident

simple-stmt : assign
            | call
            | incr-decr

expr : assign-expr

assign-expr : assign
            | cond-expr

assign : unary-expr assign-op expr

...
rel-expr : rel-expr < shift-expr
         ...
         | shift-expr

...
unary-expr : unary-op primary-expr
           | primary-expr

unary-op : + - ! ~ ++ --  // Prefix operators
         | ( type-name )  // Conversion operator

primary-expr : call
             | primary-expr . basic-name
             | primary-expr ++
             | ( expr )
             ...
             | basic-name

call : primary-expr ( params )

incr-decr : primary-expr ++
          | -- primary-expr
          | ...

So when the parser is expecting a statement, the *LR(k) item set kernel is method-body -> { * stmts-opt } and the full item set for the state looks something like this:

method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...

When an identifier is shifted, the next state is:

basic-name -> ident *
basic-name -> ident * < type-list >

which is parsed out or reduced, and brings the next state to:

nested-name -> basic-name *
primary-expr -> basic-name *

Potential conflict. In the parent state, lookaheads don't help, since there is a dot in nested-name and primary-expr. Oh, goody, let's try reducing by nested-name:

type-name -> nested-name *
nested-name -> nested-name * . basic-name

Nothing to see here... Now, how about reducing by primary-expr instead:

unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...

Now when we shift ++, we get:

primary-expr -> primary-expr ++ *
incr-decr -> primary-expr ++ *

...another reduce-reduce conflict.

Let's try shifting the ( instead of the ident:

primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident

Same issues when you shift an ident or ( onto the stack.

These are just the ones I've run into so far. Since basic-name has precedence over rel-expr, I'm assuming something like x < n would be interpreted as basic-name -> ident < type-list *, and error out if it was actually a relational expression.

My brain has reached the point where I could really use some help.

1
Does your LALR-generator implement bison-like precedence declarations? If not, how are we to interpret "Since basic-name has precedence over rel-expr..."?rici
@rici Since rel-expr derives basic-name in so-many steps, I'm saying that it has a higher precedence.Josh Wyant
That doesn't make any sense to me. The fact that A derives B makes no difference in the LR algorithm's decision to shift or reduce.rici
@rici Maybe I still don't completely understand. Since there's only once symbol of lookahead, in order to shift the < token for a rel-expr we have to reduce the ident to basic-name then on down to rel-expr to get to the state rel-expr -> rel-expr * < shift-expr. If we do that though, even if the full input is x < n > ( ), we've already reduced to rel-expr: rel-expr * < n > (), and we can't get < type-list > from the next 3 tokens because in order to do that, we have to get basic-name -> ident < type-list > * on the stack, not basic-name -> rel-expr < type-list > *.Josh Wyant
That is certainly a shift reduce conflict. It is quite possibly even an ambiguity.rici

1 Answers

2
votes

There are a few questions in your post, which makes it not really ideal for SO. But I'll try to provide some thoughts about each one. As I see it, you have three issues:

  1. Distinguishing expression statements from expressions which are not statements.

  2. Parsing hierarchically-named types in a declaration without conflicting with field-access expressions in expression statements

  3. Distinguishing between the use of < as a comparison operator and as a template bracket.


1. Distinguishing expression statements from expressions which are not statements.

As I understand, the desire is to only allow as statements expressions which have (or potentially have) some kind of side-effect: assignments, increment mutations, and subroutine calls. Roughly speaking, this corresponds to Java, whose grammar includes:

StatementExpression:
  Assignment
  PreIncrementExpression
  PreDecrementExpression
  PostIncrementExpression
  PostDecrementExpression
  MethodInvocation
  ClassInstanceCreationExpression

Each of the alternatives listed for StatementExpression also appears somewhere in the derivation tree for Expression, where they have been factored out of the lists of possibilities. Here's a very condensed sample:

Expression:
  LambdaExpression
  AssignmentExpression

AssignmentExpression:
  ConditionalExpression
  Assignment

Assignment:
  LeftHandSide AssignmentOperator Expression

...

UnaryExpression:
  PreIncrementExpression
  + UnaryExpression
  UnaryExpressionNotPlusMinus

PreIncrementExpression:
  ++ UnaryExpression

UnaryExpressionNotPlusMinus:
  PostfixExpression
  ~ UnaryExpression

PostfixExpression:
  Primary
  ExpressionName
  PostIncrementExpression

PostIncrementExpress:
  PostfixExpression ++

Note how the non-terminals used on the right-hand side of ExpressionStatement are special-cased at each precedence level. In the C++ grammar, which doesn't limit which expressions can be statements, there is no need for a separate Assignment non-terminal:

assignment-expression:
  conditional-expression
  logical-or-expression assignment-operator initializer-clause

But in Java, that won't work. It needs to create the additional non-terminal which does not derive ConditionalExpression, precisely in order to allow Assignment to be a Statement and AssignmentExpression to be an Expression.

2. Parsing hierarchically-named types in a declaration without conflicting with field-access expressions in expression statements

Similar to the above, here it is necessary to put hierarchical names (which must start with a basic-name) from other forms of field access expressions, which might start with any (other) primary-expr. The former can be type names or primary expressions; the latter can only be type names. To make this distinction, we need to split the primary-expr production:

primary-expr : field-access-expr
             | nested-name

non-field-access-expr:
               call
             | post-increment-expression  // was primary-expr ++
             | ( expr )
             ...

field-access-expr :
               non-field-access-expr
             | field-access-expr . basic-name

3. Distinguishing between the use of < as a comparison operator and as a template bracket.

Unlike the other two issues, this one might actually be an ambiguity in the language. In C++, for example, template brackets are definitely ambiguous; they can only be resolved by knowing (or being told) whether a particular name is a template name or not. In Java, on the other hand, the ambiguity is solved by sometimes requiring the type parameters to come before the generic name. For example:

ConstructorDeclarator:
  [TypeParameters] SimpleTypeName ( [FormalParameterList] )

or

MethodInvocation:
  Primary . [TypeArguments] Identifier ( [ArgumentList] )

In C#, there is yet a different rule, which requires looking at the character following the > which potentially matches the opening <. The algorithm is described in §7.6.4.2 of the C# manual; I have no idea how you would implement that in an LALR(1) parser.