4
votes

I'm currently writing my own lexer (and eventually a parser), and so far everything works fine. I'm able to recognize everything I need, except that recently I encountered a little problem. When I input an identifier such as "character", the lexer outputs a token [ KEYWORD, "char" ] and another token [ IDENTIFIER, "acter" ]. The way I am currently lexing an input is I look for keywords before identifiers, so that something like int, which is both valid for a keyword and an identifier gets assigned to a keyword first. But when an identifier contains a keyword at the beginning, it splits the ID in two, one part for the keyword and a second part for the ID. I need it to stay as an IDENTIFIER. If any code is required, I'll be glad to post it.

EDIT: Here is the grammar (no parsing rules yet) Note: It is shortened, just to keep it to the point: My keywords are before the identifiers, they have priority.

KEYWORDS: "if" | "else" | "while" | "for" | "false" | "true" | "break" | "return" | "int" | "float" | "char" | "string" | "bool" | "void" | "null";

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;

INT_LITERAL: [0-9]+;

FLOAT_LITERAL: [0-9]+ '.' [0-9]+

1
Make your matching greedy so it always tries to match as many characters as possibleMorten Jensen
You might consider posting your grammar. Assuming you have a grammar that defines any sequence of [A-Za-z_][A-Za-z_0-9] as an identifier, then you probably have a bug in your lexical analyzer where you collect identifier tokens.ChuckCottrill

1 Answers

6
votes

I assume keywords are a subset of identifiers.

You should not rely on the lexer to find keywords. Instead your lexer should only look for identifiers, greedily, i.e. it should match the longest sequence of characters that constitutes an identifier.

When it finds one, you should check yourself if the text of the identifier is one of the keywords. If it is, return a KEYWORD token, else return an IDENTIFIER token.