How to resolve ambiguity in the definition of an LR(1) grammar?

Question

I am writing a Golang compiler in OCaml, and argument lists are causing me a bit of a headache. In Go, you can group consecutive parameter names of the same type in the following way:

func f(a, b, c int)  ===  func f(a int, b int, c int)

You can also have a list of types, without parameter names:

func g(int, string, int)

The two styles cannot be mix-and-matched; either all parameters are named or none are.

My issue is that when the parser sees a comma, it doesn't know what to do. In the first example, is a the name of a type or the name of a variable with more variables coming up? The comma has a dual role and I am not sure how to fix this.

I am using the Menhir parser generator tool for OCaml.

Edit: at the moment, my Menhir grammar follows exactly the rules as specified at http://golang.org/ref/spec#Function_types

@tomwilde: I've edited the post with the proper information. The code just follows the same structure as the specification. — gnuvince
I think the simplest approach is to treat any whitespace (without a comma) as a pre-type token. The token after the space (and there can only be one space without a comma) must be the type. If the token before the space doesn't match any type in the current scope, or any built in type, it must be a named variable. This is probably overly naive. — Intermernet
If the token before the space doesn't match any type that would imply you have access to the current scope and the type names defined within it. That's a bad idea. C does it. See my answer. — thwd

rici rici · Accepted Answer · 2014-09-23T04:57:08

As written, the go grammar is not LALR(1). In fact, it is not LR(k) for any k. It is, however, unambiguous, so you could successfully parse it with a GLR parser, if you can find one (I'm pretty sure that there are several GLR parser generators for OCAML, but I don't know enough about any of them to recommend one).

If you don't want to (or can't) use a GLR parser, you can do it the same way Russ Cox did in the gccgo compiler, which uses bison. (bison can generate GLR parsers, but Cox doesn't use that feature.) His technique does not rely on the scanner distinguishing between type-names and non-type-names.

Rather, it just accepts parameter lists whose elements are either name_or_type or name name_or_type (actually, there are more possibilities than that, because of the ... syntax, but it doesn't change the general principle.) That's simple, unambiguous and LALR(1), but it is overly-accepting -- it will accept func foo(a, b int, c), for example -- and it does not produce the correct abstract syntax tree because it doesn't attach the type to the list of parameters being declared.

What that means is that once the argument list is fully parsed and is about to be inserted into the AST attached to some function declaration (for example), a semantic scan is performed to fix it up and, if necessary, produce an error message. That scan is done right-to-left over the list of declaration elements, so that the specified type can be propagated to the left.

It's worth noting that the grammar in the reference manual is also overly-accepting, because it does not express the constraint that "either all parameters are named or none are". That constraint could be expressed in an LR(1) grammar -- I'll leave that as an exercise for readers -- but the resulting grammar would be a lot more difficult to understand.

How to resolve ambiguity in the definition of an LR(1) grammar?

2 Answers