Calculate first, follow and predict from EBNF notation

Question

Grammatical rules are defined as:

an integer literal is a sequence of digits;
a boolean literal is one of true or false;
a keyword is one of if, while, or the boolean literals;
a variable is a string that starts with a letter and is followed by letters or digits, and is not a keyword;
an operator is one of <= >= == != && || = + - * < >
punctuation is one of the ( ) { } , ; characters.

Based on the description I wrote out grammar in EBNF notation as fallows:

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
int literal = digit {digit} ;
bool = "true" | "false" ;
keyword = "if" | "while" | bool ;
letter =  "A" | "B" | "C" | "D" | "E" | "F" | "G"  | "H" | "I" | "J" | "K" | "L" | "M" | "N"  | "O" | "P" |   
               "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" |
               "h" |   "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | 
               "y" | "z" ;
variable = (letter {digit | letter}) -keyword ;
operator = "<=" |">=" | "==" | "!=" | "&&" | "||" | "=" | "+" | "-" | "*" | "<" |  ">" | "!" ;
punctuation = "(" | ")" | " {" | " }" | " , " | " ; " ;

Now i want to calculate FIRST, FOLLOW and PREDICT sets but I'm not sure how to do it out of EBNF notation. Should I first change it to Chomsky normal form? Is so then how? Would that be right?

DIGIT -> 0 1 2 3 ...
INT -> DIGIT | DIGIT DIGIT
BOOL -> true false 
KEYWORD -> if while BOOL
LETTER -> A B C D ...
VARIABLE -> LETTER | LETTER DIGIT | LETTER LETTER

rici rici · Accepted Answer · 2018-03-03T05:03:27

First and follow are pretty straight-forward, even with EBNF. In this case, they are even easier, since you have no nullable non-terminals. (You need to watch out for repetition groups, since the repetition count can be 0. If you have:

... A { X ... } Y ...

then FOLLOW(A) must include both FIRST(X) and FIRST(Y). And if you have

C -> A { X }

then FOLLOW(A) must include FOLLOW(C).

None of this should be complicated if you're doing the computation by hand. For an automated solution, I would probably unroll the repetition operators into unextended BNF by creating new non-terminals, but you could do the computation directly on the EBNF as well.

The one wrinkle is your use of the set difference operator -, in

variable = (letter {digit | letter}) - keyword ;

In this particular case, it does not create any difficulties, but the general solution is tricky. In fact, since there is no guarantee that the difference between two context-free languages is context-free, it will not really be possible to find a truly general solution.

Predict sets are another story. Indeed, I'm not even 100% sure what a predict set would be for EBNF, since you need to be able to predict repetition of a subpattern, not just derivations. Again, expanding to BNF might help, but it can happen that the expansion creates a predict conflict which didn't exist in the original grammar.

The grammar you present is incomplete, so I don't know how useful computing LL(1) sets will be. I suppose that it is intended to be just the lexical part of the grammar, but really there is a reason why lexical analysis is usually done with regular expressions rather than context-free parsing.

Several reasons, really: aside from the fact that lexical analysis usually involves reasonably readable regular expressions, there is also the important fact that lexical analysis does not usually involve parsing the internal structure of a token. That lets you choose to simply recognize a repeated element rather than worrying about whether the parse tree for the repetition should be left- or right-leaning.

The key insight about computing FIRST and FOLLOW sets is that they mean just what their names indicate. The FIRST set of a non-terminal is precisely the set of tokens which can begin a complete derivation from the non-terminal; similarly, the FOLLOW set is precisely the set of tokens which might immediately follow the non-terminal during a derivation from the start symbol. In many simple grammars, these sets can be computed by inspection; that certainly should be the case for your grammar, at least for the FIRST sets.

The fact that you have no start symbol here is another indication that you are probably not solving the right problem; without a start symbol, there is no meaningful definition of FOLLOW.

If you are trying to do lexical analysis, you might be able to get away with:

start -> { token }
token -> int literal | keyword | identifier | ...

Although to be formally correct, you'd also need to handle "ignored tokens" such as comments and whitespace.

Calculate first, follow and predict from EBNF notation

1 Answers