7
votes

Suppose you have a language where identifiers might begin with keywords. For example, suppose "case" is a keyword, but "caser" is a valid identifier. Suppose also that the lexer rules can only handle regular expressions. Then it seems that I can't place keyword rules ahead of the identifier rule in the lexer, because this would parse "caser" as "case" followed by "r". I also can't place keyword lexing rules after the identifier rule, since the identifier rule would match the keywords, and the keyword rules would never trigger.

So, instead, I could make a keyword_or_identifier rule in the lexer, and have the parser determine if a keyword_or_identifier is a keyword or an identifier. Is this what is normally done?

I realize that "use a different lexer that has lookahead" is an answer (kind of), but I'm also interested in how this is done in a traditional DFA-based lexer, since my current lexer seems to work that way.

1
Why do you think you can't place keyword rules first? That's indeed the "usual solution", and it is not contradictory with constructing a DFA.rici
I edited the question to clarify this.BenRI

1 Answers

8
votes

Most lexers, starting with the original lex, match alternatives as follows:

  1. Use the longest match.

  2. If there are two or more alternatives which tie for the longest match, use the first one in the lexer definition.

This allows the following style:

"case"                  { return CASE; }
[[:alpha:]][[:alnum:]]* { return ID; }

If the input pattern is caser, then the second alternative will be used because it's the longest match. If the input pattern is case r, then the first alternative will be used because both of them match case, and by rule (2) above, the first one wins.

Although this may seem a bit arbitrary, it's consistent with the DFA approach, mostly. First of all, a DFA doesn't stop the first time it reaches an accepting state. If it did, then patterns like [[:alpha:]][[:alnum:]]* would be useless, because they enter an accepting state on the first character (assuming its alphabetic). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they back up until the last accepting state. (See below.)

A given DFA state may be accepting because of two different rules, but that's not a problem either; only the first accepting rule is recorded.

To be fair, this is slightly different from the mathematical model of a DFA, which has a transition for every symbol in every state (although many of them may be transitions to a "sink" state), and which matches a complete input depending on whether or not the automaton is in an accepting state when the last symbol of the input is read. The lexer model is slightly different, but can easily be formalized as well.

The only difficulty in the theoretical model is "back up to the last accepting state". In practice, this is generally done by remembering the state and input position every time an accepting state is reached. This does mean that it may be necessary to rewind the input stream, possibly by an arbitrary amount.

Most languages do not require backup very often, and very few require indefinite backup. Some lexer generators can generate faster code if there are no backup states. (flex will do this if you use -Cf or -CF.)

One common case which leads to indefinite backup is failing to provide an appropriate error return for string literals:

["][^"\n]*["]    { return STRING; }
 /* ... */
.                { return INVALID; }

Here, the first pattern will match a string literal starting with " if there is a matching " on the same line. (I omitted \-escapes for simplicity.) If the string literal is unterminated, the last pattern will match, but the input will need to be rewound to the ". In most cases, it's pointless trying to continue lexical analysis by ignoring an unmatched "; it would make more sense to just ignore the entire remainder of the line. So not only is backing up inefficient, it also is likely to lead to an explosion of false error messages. A better solution might be:

["][^"\n]*["]    { return STRING; } 
["][^"\n]*       { return INVALID_STRING; }

Here, the second alternative can only succeed if the string is unterminated, because if the string is terminated, the first alternative will match one more character. Consequently, it doesn't even matter in which order these alternatives appear, although everyone I know would put them in the same order I did.