8
votes

I am writing a Golang compiler in OCaml, and argument lists are causing me a bit of a headache. In Go, you can group consecutive parameter names of the same type in the following way:

func f(a, b, c int)  ===  func f(a int, b int, c int)

You can also have a list of types, without parameter names:

func g(int, string, int)

The two styles cannot be mix-and-matched; either all parameters are named or none are.

My issue is that when the parser sees a comma, it doesn't know what to do. In the first example, is a the name of a type or the name of a variable with more variables coming up? The comma has a dual role and I am not sure how to fix this.

I am using the Menhir parser generator tool for OCaml.

Edit: at the moment, my Menhir grammar follows exactly the rules as specified at http://golang.org/ref/spec#Function_types

2
what does your grammar look like?thwd
@tomwilde: I've edited the post with the proper information. The code just follows the same structure as the specification.gnuvince
I think the simplest approach is to treat any whitespace (without a comma) as a pre-type token. The token after the space (and there can only be one space without a comma) must be the type. If the token before the space doesn't match any type in the current scope, or any built in type, it must be a named variable. This is probably overly naive.Intermernet
If the token before the space doesn't match any type that would imply you have access to the current scope and the type names defined within it. That's a bad idea. C does it. See my answer.thwd

2 Answers

5
votes

As written, the go grammar is not LALR(1). In fact, it is not LR(k) for any k. It is, however, unambiguous, so you could successfully parse it with a GLR parser, if you can find one (I'm pretty sure that there are several GLR parser generators for OCAML, but I don't know enough about any of them to recommend one).

If you don't want to (or can't) use a GLR parser, you can do it the same way Russ Cox did in the gccgo compiler, which uses bison. (bison can generate GLR parsers, but Cox doesn't use that feature.) His technique does not rely on the scanner distinguishing between type-names and non-type-names.

Rather, it just accepts parameter lists whose elements are either name_or_type or name name_or_type (actually, there are more possibilities than that, because of the ... syntax, but it doesn't change the general principle.) That's simple, unambiguous and LALR(1), but it is overly-accepting -- it will accept func foo(a, b int, c), for example -- and it does not produce the correct abstract syntax tree because it doesn't attach the type to the list of parameters being declared.

What that means is that once the argument list is fully parsed and is about to be inserted into the AST attached to some function declaration (for example), a semantic scan is performed to fix it up and, if necessary, produce an error message. That scan is done right-to-left over the list of declaration elements, so that the specified type can be propagated to the left.

It's worth noting that the grammar in the reference manual is also overly-accepting, because it does not express the constraint that "either all parameters are named or none are". That constraint could be expressed in an LR(1) grammar -- I'll leave that as an exercise for readers -- but the resulting grammar would be a lot more difficult to understand.

3
votes

You don't have ambiguity. The fact that the standard Go parser is LALR(1) proves that.

is a the name of a type or the name of a variable with more variables coming up?

So basically your grammar and the parser as a whole should be completely disconnected from the symbol table; don't be C – your grammar is not ambiguous therefore you can check the type name later in the AST.

These are the relevant rules (from http://golang.org/ref/spec); they are already correct.

Parameters     = "(" [ ParameterList [ "," ] ] ")" .
ParameterList  = ParameterDecl { "," ParameterDecl } .
ParameterDecl  = [ IdentifierList ] [ "..." ] Type .
IdentifierList = identifier { "," identifier } .

I'll explain them to you:

IdentifierList = identifier { "," identifier } .

The curly braces represent the kleene-closure (In POSIX regular expression notation it's the asterisk). This rule says "an identifier name, optionally followed by a literal comma and an identifier, optionally followed by a literal comma and an identifier, etc… ad infinitum"

ParameterDecl  = [ IdentifierList ] [ "..." ] Type .

The square brackets are nullability; this means that that part may or may not be present. (In POSIX regular expression notation it's the question mark). So you have "Maybe an IdentifierList, followed by maybe an ellipsis, followed by a type.

ParameterList  = ParameterDecl { "," ParameterDecl } .

You can have several ParameterDecl in a list like e.g. func x(a, b int, c, d string).

Parameters     = "(" [ ParameterList [ "," ] ] ")" .

This rules defines that a ParameterList is optional and to be surrounded by parenthesis and may include an optional final comma literal, useful when you write something like:

func x(
    a, b int,
    c, d string, // <- note the final comma
)

The Go grammar is portable and can be parsed by any bottom-up parser with one token of lookahead.


Edit regarding "don't be C": I said this because C is context-sensitive and the way they solve this problem in many (all?) compilers is by wiring the symbol table to the lexer and lexing tokens differently depending on if they are defined as type names or variables. This is a hack and should not be done for unambiguous grammars!