Difficulty with my lexical analyzer

Question

I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; for each group there's a regular expression, which will be converted to a DFA:

Keyword - (will have a symbol table containing "goto", "int"....)
Identifers - [a-zA-z][a-zA-Z0-9]*
Numeric Constants - [0-9]+/.?[0-9]*
String Constants - ""[EVERY_ASCII_CHARACTER]*""
Special Symbols - (will have a symbol table containing ";", "(", "{"....)
Operators - (will have a symbol table containing "+", "-"....)

My Analyzer's input is a stream of bytes/ASCII characters. My algorithm is the following:

assuming there's a stream of characters, x1...xN
 foreach i=1, i<=n, i++
    if x1...xI accepts one or more of the 6 group's DFA
    {
       take the longest-token
       add x1...xI to token-linked-list
       delete x1...xI from input
    }

However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*).

Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" which of course is an error.

I'm trying to think about a new algorithm, but I keep failing. Any suggestions?

"I'm programming a compiler." -- Uh, so? That's not inconsistent with using flex. But if you must roll your own ... a lot has been written about how to write lexers, and I suggest you read some of it, because your approach sucks, is horridly slow, and yields the wrong results. — Jim Balter
@RonPinkas Gee, Ron, I would certainly want to know that my approach sucks, and I can't comprehend not understanding why. — Jim Balter
@RonPinkas Perhaps you didn't read my comment, which served a lot more purpose that these comments of yours do. — Jim Balter

Jonathan Leffler Jonathan Leffler · Accepted Answer · 2014-10-16T02:29:24

Code your scanner so that it treats identifiers and keywords the same until the reading is finished.

When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. This deals with the intx problem immediately; the scanner reads intx and that's not a keyword so it must be be an identifier.

I note that your identifiers don't allow underscores. That's not necessarily a problem, but many languages do allow underscores in identifiers.

Difficulty with my lexical analyzer

3 Answers