I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.
In bison, one uses the %token
declaration to define token names, for example
%token INTEGER
%token VARIABLE
and so forth. When bison is used on the grammar specification file, a header file y.tab.h
is generated which has some define directives for each token:
#define INTEGER 258
#define VARIABLE 259
Finally, the lexer includes the header file y.tab.h
by returning the right code for each token:
%{
#include "y.tab.h"
%}
%%
[a-z] {
// some code
return VARIABLE;
}
[1-9][0-9]* {
// some code
return INTEGER;
}
So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.
Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?