Efficient differentiation between keyword and Identifiers

Question

I am building a compiler and in lexical analyzer phase:

Install the reserved words in the symbol table initially.A field of the symbol-table entry indicates that these strings are never ordinary identi-fiers, and tells which token they represent. We have supposed that thismethod is in use in Fig. 3.14. When we find an identifier, a call to installID places it in the symbol table if it is not already there and returns a pointerto the symbol-table entry for the lexeme found. Of course, any identifiernot in the symbol table during lexical analysis cannot be a reserved word,so its token is id. The function getToken examines the symbol table entryfor the lexeme found, and returns whatever token name the symbol tablesays this lexeme represents either id or one of the keyword tokens thatwas initially installed in the table.

But now everytime i recognize a keyword, i will have to go through the entire symbol table, it's like comparing 'n' elements for every keyword/Id recognition. Wont it be too inefficient. What else can i do?

Kindly help.

For starters, don't ever do a linear search of a table that can be bigger than 3 elements. — Ira Baxter
wow, 3 elements is too strict rule. I believe it depends on language and table implementation. — Odobenus Rosmarus
It might, but if you stuck to this rule you likely won't get hurt. — Ira Baxter

Ira Baxter Ira Baxter · Accepted Answer · 2012-02-04T18:58:42

If you build a finite state automata to identify lexemes, then its terminal states should correspond to language lexemes.

You can leave keywords out of the FSA and you'll end up with only a single terminal state for strings that look like identifiers. This is a common implementation when coding the FSA by hand. You'll have the problem you have now. As a practical matter for the symbol table no matter what you do with keywords, you will want an extremely fast identifier lookup which pretty much suggests you need a hashing solution. If you have this, then you can do a lookup quickly and check your "it must be a keyword" bit. There are plenty of good hash schemes in existence; as usual, Wikipedia on hash functions is a pretty good place to start. This is a practical solution; I use it in my PARLANSE compiler (see my bio) which processes million-line files in a few tens of seconds.

This isn't really the fastest solution. It is better to include the keywords in the FSA (this tends to encourage the use of a lexer generator, because adding all the keywords to an manualy coded FSA is inconvenient, but not hard). If you do that, and you have keywords that look like identifiers, e.g., goto, there will be terminal states that in effect indicate that you have recognized an identifier that happens to be spelled as a specific keyword.

How you interpret that end state is up to you. One obvious choice is that such end states indicate you have found a keyword. No hash table lookup required.

Efficient differentiation between keyword and Identifiers

3 Answers