I've recently wrapped my head around LALR enough to write an LALR generator, and I'm trying to construct a java- or c#-style grammar for it (the beginnings of which are specified here).
I know it's extra effort to write the parser generator, like re-inventing the wheel (why not use Antlr?), but my goal is to bootstrap a hobby OS that can compile itself without depending on a third-party toolchain. My problem though is not with the generator, but with the grammar.
I'm running into reduce/reduce ambiguities with statements and expressions.
I know how to resolve certain types of ambiguities, such as dangling-else, but these few have are not intuitive to me, and they have me stumped.
What's the best way to resolve these? Also, is there a prototyping tool that I can use to help visualize the solution? Or, should I go back to square one and try implementing a GLR parser generator for the grammar?
These statements are legal:
Generic.List<int> myVar1 = x + 4, myVar2; // stmt -> var-decl ;
// var-decl -> type-name var-decl-list
t = 99; // simple-stmt -> assign
i++; // simple-stmt -> incr-decr
// incr-decr -> primary-expr ++
json.deserialize<list<int>>(obj); // simple-stmt -> call
// call -> primary-expr ( params )
// ... -> primary-expr . basic-name ( params )
// ... -> basic-name . basic-name ( params )
Here's how it's set up:
basic-name : ident < type-list >
| ident
nested-name : nested-name . basic-name
| basic-name
basic-type : int | bool | ...
type-name : nested-name
| basic-type
stmt : var-decl ;
| simple-stmt ;
| ...
var-decl : type-name var-decl-list
var-decl-list : var-decl-list , var-init
| var-init
var-init : ident assign-op expression
| ident
simple-stmt : assign
| call
| incr-decr
expr : assign-expr
assign-expr : assign
| cond-expr
assign : unary-expr assign-op expr
...
rel-expr : rel-expr < shift-expr
...
| shift-expr
...
unary-expr : unary-op primary-expr
| primary-expr
unary-op : + - ! ~ ++ -- // Prefix operators
| ( type-name ) // Conversion operator
primary-expr : call
| primary-expr . basic-name
| primary-expr ++
| ( expr )
...
| basic-name
call : primary-expr ( params )
incr-decr : primary-expr ++
| -- primary-expr
| ...
So when the parser is expecting a statement, the *LR(k) item set kernel is method-body -> { * stmts-opt }
and the full item set for the state looks something like this:
method-body -> { * stmts-opt }
stmts-opt -> * stmts
stmts-opt -> *
stmts -> * stmts stmt
stmt -> * var-decl ;
stmt -> * simple-stmt ;
var-decl -> * type-name var-decl-list
simple-stmt -> * assign
simple-stmt -> * call
simple-stmt -> * incr-decr
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
incr-decr -> * primary-expr ++
...
When an identifier is shifted, the next state is:
basic-name -> ident *
basic-name -> ident * < type-list >
which is parsed out or reduced, and brings the next state to:
nested-name -> basic-name *
primary-expr -> basic-name *
Potential conflict. In the parent state, lookaheads don't help, since there is a dot in nested-name
and primary-expr
. Oh, goody, let's try reducing by nested-name:
type-name -> nested-name *
nested-name -> nested-name * . basic-name
Nothing to see here...
Now, how about reducing by primary-expr
instead:
unary-expr -> primary-expr *
primary-expr -> primary-expr * . basic-name
primary-expr -> primary-expr * ++
call -> primary-expr * ( params )
incr-decr -> primary-expr * ++
...
Now when we shift ++, we get:
primary-expr -> primary-expr ++ *
incr-decr -> primary-expr ++ *
...another reduce-reduce conflict.
Let's try shifting the (
instead of the ident
:
primary-expr -> ( * expr )
unary-op -> ( * type-name )
expr -> * assign-expr
assign-expr -> * assign
assign-expr -> * cond-expr
assign -> * unary-expr assign-op expr
unary-expr -> * unary-op primary-expr
unary-expr -> * primary-expr
unary-op -> * ( typename )
unary-op -> * ! | ~ | ...
primary-expr -> * call
primary-expr -> * primary-expr . basic-name
primary-expr -> * primary-expr ++
primary-expr -> * basic-name
primary-expr -> * ( expr )
call -> * primary-expr ( params )
cond-expr -> * ...
...
rel-expr -> * rel-expr < shift-expr
rel-expr -> * shift-expr
...
type-name -> * nested-name
type-name -> * basic-type
nested-name -> * nested-name . basic-name
nested-name -> * basic-name
basic-name -> * ident < type-list >
basic-name -> * ident
Same issues when you shift an ident
or (
onto the stack.
These are just the ones I've run into so far. Since basic-name
has precedence over rel-expr
, I'm assuming something like x < n
would be interpreted as basic-name -> ident < type-list *
, and error out if it was actually a relational expression.
My brain has reached the point where I could really use some help.
rel-expr
derivesbasic-name
in so-many steps, I'm saying that it has a higher precedence. – Josh Wyant<
token for a rel-expr we have to reduce theident
tobasic-name
then on down torel-expr
to get to the staterel-expr -> rel-expr * < shift-expr
. If we do that though, even if the full input isx < n > ( )
, we've already reduced torel-expr
:rel-expr * < n > ()
, and we can't get< type-list >
from the next 3 tokens because in order to do that, we have to getbasic-name -> ident < type-list > *
on the stack, notbasic-name -> rel-expr < type-list > *
. – Josh Wyant