When is lexing enough, when do you need EBNF?
EBNF really doesn't add much to the power of grammars. It's just a convenience / shortcut notation / "syntactic sugar" over the standard Chomsky's Normal Form (CNF) grammar rules. For example, the EBNF alternative:
S --> A | B
you can achieve in CNF by just listing each alternative production separately:
S --> A // `S` can be `A`,
S --> B // or it can be `B`.
The optional element from EBNF:
S --> X?
you can achieve in CNF by using a nullable production, that is, the one which can be replaced by an empty string (denoted by just empty production here; others use epsilon or lambda or crossed circle):
S --> B // `S` can be `B`,
B --> X // and `B` can be just `X`,
B --> // or it can be empty.
A production in a form like the last one B
above is called "erasure", because it can erase whatever it stands for in other productions (product an empty string instead of something else).
Zero-or-more repetiton from EBNF:
S --> A*
you can obtan by using recursive production, that is, one which embeds itself somewhere in it. It can be done in two ways. First one is left recursion (which usually should be avoided, because Top-Down Recursive Descent parsers cannot parse it):
S --> S A // `S` is just itself ended with `A` (which can be done many times),
S --> // or it can begin with empty-string, which stops the recursion.
Knowing that it generates just an empty string (ultimately) followed by zero or more A
s, the same string (but not the same language!) can be expressed using right-recursion:
S --> A S // `S` can be `A` followed by itself (which can be done many times),
S --> // or it can be just empty-string end, which stops the recursion.
And when it comes to +
for one-or-more repetition from EBNF:
S --> A+
it can be done by factoring out one A
and using *
as before:
S --> A A*
which you can express in CNF as such (I use right recursion here; try to figure out the other one yourself as an exercise):
S --> A S // `S` can be one `A` followed by `S` (which stands for more `A`s),
S --> A // or it could be just one single `A`.
Knowing that, you can now probably recognize a grammar for a regular expression (that is, regular grammar) as one which can be expressed in a single EBNF production consisting only from terminal symbols. More generally, you can recognize regular grammars when you see productions similar to these:
A --> // Empty (nullable) production (AKA erasure).
B --> x // Single terminal symbol.
C --> y D // Simple state change from `C` to `D` when seeing input `y`.
E --> F z // Simple state change from `E` to `F` when seeing input `z`.
G --> G u // Left recursion.
H --> v H // Right recursion.
That is, using only empty strings, terminal symbols, simple non-terminals for substitutions and state changes, and using recursion only to achieve repetition (iteration, which is just linear recursion - the one which doesn't branch tree-like). Nothing more advanced above these, then you're sure it's a regular syntax and you can go with just lexer for that.
But when your syntax uses recursion in a non-trivial way, to produce tree-like, self-similar, nested structures, like the following one:
S --> a S b // `S` can be itself "parenthesized" by `a` and `b` on both sides.
S --> // or it could be (ultimately) empty, which ends recursion.
then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way; you'll end up with substituting for S
indefinitely, which will always add another a
s and b
s on both sides. Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many a
s were there to match them evenly with so many b
s. Grammars like this are called context-free grammars (at the very least), and they require a parser.
Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax. But there's more. Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently. For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round). This kind of grammar is called context-sensitive. You can recognize it by that it has more than one symbol on the left (before the arrow). For example:
A R B --> A S B
You can think of these additional symbols on the left as a "context" for applying the rule. There could be some preconditions, postconditions etc. For example, the above rule will substitute R
into S
, but only when it's in between A
and B
, leaving those A
and B
themselves unchanged. This kind of syntax is really hard to parse, because it needs a full-blown Turing machine. It's a whole another story, so I'll end here.