5
votes

I found the following EBNF on wikipedia, describing EBNF:

letter = "A" | "B" | "C" | "D" | "E" | "F" | "G"
   | "H" | "I" | "J" | "K" | "L" | "M" | "N"
   | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
   | "V" | "W" | "X" | "Y" | "Z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
symbol = "[" | "]" | "{" | "}" | "(" | ")" | "<" | ">"
   | "'" | '"' | "=" | "|" | "." | "," | ";" ;
character = letter | digit | symbol | "_" ;

identifier = letter , { letter | digit | "_" } ;
terminal = "'" , character , { character } , "'" 
     | '"' , character , { character } , '"' ;

lhs = identifier ;
rhs = identifier
 | terminal
 | "[" , rhs , "]"
 | "{" , rhs , "}"
 | "(" , rhs , ")"
 | rhs , "|" , rhs
 | rhs , "," , rhs ;

rule = lhs , "=" , rhs , ";" ;
grammar = { rule } ;

Now, because of my limited knowledge on parsers and grammars, I don't know if this is an LL(1) grammar. I tried to write a parser for it, but it fails when trying to read rhs, which reads itself again, which reads itself again, which oh, you got it...

  • Is it an LL(1) grammar?
  • If not, how to turn it into one (possible?)?
2

2 Answers

12
votes

The quoted Wikipedia extract is not a correct EBNF grammar for EBNF. It's also not left-parseable: indeed, it is ambiguous, so it's not unambiguously parseable at all.

In general, the terms LL(k) and LR(k) (and many other such terms) apply to Context-Free Grammars (CFGs) (and, by extension, the languages generated by those grammars). EBNF is not a formalism for describing CFGs. It is designed to be a formalism to describe context-free languages and therefore it should be possible to create a CFG from a given EBNF grammar (but see Note 1), but there is not a direct correspondence between an EBNF syntax rule and a single production in a CFG.

That said, you can usually directly create a CFG by using some standard transformations. For example:

{ ... }

can be substituted with the generated non-terminal M'', with the addition of the following productions: (ε is the empty string)

M'  → ...
M'' → ε
M'' → M' M''

The above transformation does not introduce left-recursion, so it does not artificially make the original grammar non-LL(1).

The most important error in the cited grammar [Note 2] is the ambigous EBNF rule:

rhs = identifier
    | terminal
    | "[" , rhs , "]"
    | "{" , rhs , "}"
    | "(" , rhs , ")"
    | rhs , "|" , rhs
    | rhs , "," , rhs
    ;

It's also left-recursive, so it will not correspond to an LL(1) CFG. But more importantly, it does not indicate either the associativity or the precedence of the | and , operators. (Semantically, these operators do not have a defined associativity, but the syntax should still specify one; otherwise, it is impossible to unambiguously create a parse tree. The precedence between the two operators is important semantically.)

A better set of rules would be:

primary = identifier
        | terminal
        | "[" , rhs , "]"
        | "{" , rhs , "}"
        | "(" , rhs , ")"
        ;
factor  = primary , { "|" , primary } ;
rhs     = factor ,  { "," , factor } ;

This is still an oversimplification, but it covers a large number of use cases. It's neither ambiguous nor left-recursive. [3]


Notes

  1. Syntactic constraints specified in comments may not be easy to translate into CFGs, though. For example, the ISO standard EBNF for EBNF defines the non-terminal "syntactic exception" as follows:

    syntactic exception =
       ? a syntactic-factor that could be replaced
         by a syntactic-factor containing no
         meta-identifiers
       ?


    The intention of the above text is to restrict an exception to be a regular language. That's important, since the set difference between two context-free languages is not necessarily context-free, while the difference between a context-free language and a regular language is provably context-free. Ironically, the "special sequence" describing this restriction cannot be expressed as a context-free grammar because it depends on the definition of meta-identifiers. (If it had said "a syntactic-factor containing no meta-identifiers" it would be easy to write without resorting to a special sequence, but clearly that was not the intent.)

  2. There is another important error in the Wikipedia excerpt. It defines both types of quoted strings as having the same body, but that's not correct; a double-quoted string cannot include double-quote characters, and a single-quoted string cannot contain single-quote characters. So the use of the identifier character in both of those definitions is incorrect.

  3. The formal EBNF grammar allows primary to be empty. I left that out, because it's not usually needed.

2
votes

In short, no, your grammar is not LL(1).

The first reason is the left-recursion of rhs you already discovered. I assume, you wrote a recursive descent parser (or something else that bases on LL(1) grammars). Such a parser is not able to handle left-recursive rules since they cause a special case of a so-called FIRST/FIRST conflict (cf. 1).

To tackle this problem and answer the second part of your question, you can left-factor your grammar and replace left-recursion as shown in 2.