5
votes

Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.

On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:

optional : OPTIONAL identifier '=' expression ;

If I then define identifier as, say:

identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */ 
    | IDENTIFIER ;

It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...

Is there an idiomatic way to solve this?

3

3 Answers

1
votes

Is there an idiomatic way to solve this?

Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.

The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.

You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.

1
votes

You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.

Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).

1
votes

This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:

identifier : keyword | IDENTIFIER ;

keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;

If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.