0
votes

I am trying to write a LBNF/BNFC grammar for a C-like language. In C there are many possible modifiers that you may or may not write in front of a declaration (like inline, const, volatile and so on).

I am trying to write my grammar to reuse code and make the resulting Haskell AST easy to consume. A grammar for types could look like this:

rule TypeName ::= "bool" | "int" | "double" | "void" | Id ;

Type. Type ::= TypeQualifier TypeName;

ConstModifier.    TypeModifier ::= "const" ;
VolatileModifier. TypeModifier ::= "volatile" ;
NoModifier.       TypeModifier ::= ;

And for a function declaration it might look like this:

Fun. Fun ::= FunModifier Type Id "(" [Param] ")" ";" ;

InlineModifier. FunModifier ::= "inline" ;
NoFunModifier.  FunModifier ::= ;

The problem with this is I'm getting a ton of shift/reduce and sometimes even reduce/reduce conflicts due to these optional prefixes. An alternative grammar that avoids these conflicts could look like this:

NotInlinedFun. Fun ::= Type Id "(" [Param] ")" ";" ;
InlinedFun.    Fun ::= "inline" Type Id "(" [Param] ")" ";" ;

or

NotInlinedFun. Fun ::= FunRest
InlinedFun.    Fun ::= "inline" FunRest;

FunRest.   FunRest ::= Type Id "(" [Param] ")" ";" ;

which leads to a Haskell AST like this:

data Fun = AFun FunRest | BFun FunRest | CFun FunRest
data FunRest = FunRest Type Id [Param]

instead of the more appealing

data Fun = Fun Modifier Type Id [Param]
data Modifier = A | B | C

You can see how this could quickly lead to a combinatorial explosion of rules or a Haskell AST that won't be pleasant to use.

How can I best avoid these conflicts?

1

1 Answers

1
votes

When you see nothing before an int, you don't know whether that nothing is the absence of a variable modifier or the absence of a function modifier, precisely because you don't yet know whether the int refers to a variable or the return value of a function. So if the parser is working only with a single token of lookahead, you must avoid forcing it to make a decision.

Fabricating a non-terminal out of nothing is a form of forcing the parser to decide what kind of nothing is being examined, so that, too, must be avoided. But that's not the only example; had you included static, for example, you would find that trying to classify it as a variable modifier or a function modifier would lead to the same (reduce/reduce) conflict.

But in any case, the real C grammar is more subtle. For example, the following is legal:

static inline const int* extract(int arg);

And so is this:

/* The second const is irrelevant to this discussion. */
volatile const unsigned char* const reg = 0x01A4; 

So a declaration can have a lot of qualifiers, not just zero or one. In some cases, repetition makes a difference:

long long very_wide;

In other cases, it doesn't:

inline inline int f(void);

While these constraints could be expressed as a context-free grammar, I've never seen it done; as you say, the exponential explosion is unmanageable. The actual C grammar, as described in the C standard, does not attempt this feat; it simply allows a declaration to contain an arbitrarily order of possibly repeated declaration-specifiers (see ยง6.7) and then forces the semantic analysis to distinguish between correct and incorrect sequences.