I'm currently writing my own lexer (and eventually a parser), and so far everything works fine. I'm able to recognize everything I need, except that recently I encountered a little problem. When I input an identifier such as "character", the lexer outputs a token [ KEYWORD, "char" ]
and another token [ IDENTIFIER, "acter" ]
. The way I am currently lexing an input is I look for keywords before identifiers, so that something like int
, which is both valid for a keyword and an identifier gets assigned to a keyword first. But when an identifier contains a keyword at the beginning, it splits the ID
in two, one part for the keyword and a second part for the ID
. I need it to stay as an IDENTIFIER
. If any code is required, I'll be glad to post it.
EDIT: Here is the grammar (no parsing rules yet) Note: It is shortened, just to keep it to the point: My keywords are before the identifiers, they have priority.
KEYWORDS: "if" | "else" | "while" | "for" | "false" | "true" | "break" | "return" | "int" | "float" | "char" | "string" | "bool" | "void" | "null";
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
INT_LITERAL: [0-9]+;
FLOAT_LITERAL: [0-9]+ '.' [0-9]+