6
votes

I want to understand something about implementing a lexer, and I dont want to use scanner generators. From what I have read is that I identify the specification of the language with regular expressions each for different token. Then I should make a big regular expression ORing all of the token's expressions, right?! Then create NFA then DFA of this big regular expression, right?! If so then when a word get matched by the final DFA how will I know which token this word represents?!

2
once a word has been matched, you know where you started, and you know where you ended, right?wildplasser

2 Answers

4
votes

What you're describing here is implementing a generated scanner by hand. That's not how you do it. Just write a loop containing a large switch statement whose cases are the initial letters of each token type, and each case is a loop to consume the rest of the token and return its type. The whitespace case is identical except that it doesn't return. The case for identifiers also needs to lookup a keyword table.

4
votes

FSM-based Lexer

Writing a finite state machine (FSM) lexer by hand, you loop over the characters in your input and process that with two levels of switch statements:

  • the outer switch statement is switching on the state;
  • the inner switch statement is switching on the character that was read.

This is an implementation of a finite state machine to process the tokens. The downside to this is that it can get complex to maintain, especially if there are many states to process each token. The upside is that it is easier to refactor to take advantage of finite state machines (e.g. using transition tables instead of switch statements).

Transition tables are like the switch statements: the rows define the states, the columns define the data values and the cells define the next state to transition to (with something like -1 to signal to stop processing). With this approach, the end state can be used to determine the token type. Here, you will have a token_type tokens[N_STATES]; array that you can then do token = tokens[current_state] to get the token.

Non-FSM Lexer

The alternative approach is to switch on the first character and then read the rest of the characters in that token as part of that case statement. This can be easier to read and simpler to write.

You can also split the characters into different classes (e.g. numbers, letters, minus and less than) that you can define as a 256 item lookup table. This can simplify the case statements.

Regular Expressions

Using a big regular expression is problematic as you noted, as you cannot get the token type. An approach here is to have a list of regular expressions that match the token and associate that with the token type. For example, in python:

_tokens = [
    (re.compile('\\s+'), WhiteSpace),
    (re.compile('[a-zA-Z_][a-zA-Z0-9_]*'), Identifier),
    (re.compile('[0-9]+'), Integer),
]

You will need to order these correctly (e.g. placing keyword matching before identifiers).