Is it a Lexer's Job to Parse Numbers and Strings?

Question

Is it a lexer's job to parse numbers and strings?

This may or may not sound dumb, given that fact that I'm asking whether a lexer should parse input. However, I'm not sure whether that's in fact the lexer's job or the parser's job, because in order to lex properly, the lexer needs to parse the string/number in the first place, so it would seem like code would be duplicated if the parser does this.

Is it indeed the lexer's job? Or should the lexer simply break up a string like 123.456 into the strings 123, ., 456 and let the parser figure out the rest? Doing this wouldn't be so straightforward with strings...

@Grady: Kinda sorta... not really. xD I'm trying to make my own lexer (and hopefully parser) by hand, and I can't figure out which one should do what for parsing numbers and strings. I'm not using any external tools like Lex/YACC, so in that regard, no. — user541686
Intriguing, but still unclear. Do you mean is it the lexer's job to identify the type of number for the parser, beyond identifying the token? In which case, as you say, you have kind of a circular reference (how can it know the format to parse before knowing the type). I think this is why C# (for instance) allows suffixes on floating-point numbers (F and D) to disambiguate floats from doubles for the lexer. My guess is, when the lexer's done, there should be no question about the tokens, but there may be some question about the types. Literal formats have to be crafted to allow this. — harpo
@harpo: No, I'm not referring to the types. (Or am I? I think I got confused by what you meant by "type".) I'm just referring to the fact that, for a lexer to be able to say "here is a string literal", it would need to process all the escape codes and everything. But if all it does is just returning a portion of the input, then the parser will have to do the exact same thing again -- which is duplication, isn't it? — user541686
You mean it has to "process" escape sequences by recognizing that they do not terminate the string, but are part of it; and by handing this "literally" to the parser, it then becomes the parser's job to encode them. Alternatively, the lexer could perform the encoding and hand the parser a "ready" string. The latter feels more right to me, for strings, but for numbers that would seem to imply that the lexer is converting input to a ready number for the parser, which does seem to be going out of scope. Just thinking out loud here. — harpo

Ira Baxter Ira Baxter · Accepted Answer · 2011-06-12T05:07:57

The simple answer is "Yes".

In the abstract, you don't need lexers at all. You could simply write a grammer that used individual characters as tokens (and in fact that's exactly what SGLR parsers do, but that's a story for another day).

You need lexers because parsers built using characters as primitive elements aren't as efficient as parsers that break the input stream into "tokens", where tokens are the primitive elements of the language you are parsing (whitespace, keywords, identifiers, numbers, operators, strings, comments, ...). [If you don't care about efficiency you can skip the rest of this answer and go read about SGLR parsers].

Good lexers typically take sets of regular expressions representing the language elements, and compile them into an efficient finite state machine that can segment the input stream into such language elements quickly. (If you don't want to user a lexer generator, for simple languages you can code the FSA yourself). Such compiled FSAs execute only a few tens of machine instructions per input character (get character from input buffer, switch on character to new state, decide if token is complete, if not do it again), and can thus be extremely fast.

The output of such lexers is typically a code representing the langauge element (or nothing for whitespace if the parser would ignore it anyway) and some position information (starts in file foo, line 17 column 3) to enable error reporting.

One can stop there and have useful lexers. It is often useful to do a conversion step, that converts the character string into the equivalent native machine value for that token, either as the characters are collected, or when the token is complete, because one still has knowledge of the specific characters involved in the token. This is used to convert numbers (of varying radixes) in the target language to their native binary equivalent, to convert literal strings containing escape sequences into the actual characters making up the string, and even taking identifier names and looking them up in a hash table so that identical identifiers are easily determined. The parser typically isn't interested in these converted values, but steps beyond parsing (semantic analysis, checking for optimizations, code generation) needs the converted values anyway, so you might as well convert them as you discover them. (You could delay this conversion until their binary value was needed, but in practice you almost always need the value so delaying the conversion doesn't buy very much).

Is it a Lexer's Job to Parse Numbers and Strings?

3 Answers