lex/yacc: why do lexers have to include a parser's header file?

Question

I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.

In bison, one uses the %token declaration to define token names, for example

%token INTEGER
%token VARIABLE

and so forth. When bison is used on the grammar specification file, a header file y.tab.h is generated which has some define directives for each token:

#define INTEGER 258
#define VARIABLE 259

Finally, the lexer includes the header file y.tab.h by returning the right code for each token:

%{
#include "y.tab.h"
%}

%%

[a-z]   { 
            // some code
            return VARIABLE;
        }

[1-9][0-9]* {
            // some code
            return INTEGER;
        }

So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.

Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?

rici rici · Accepted Answer · 2014-02-07T22:04:55

As with many things, it's just a historical accident. It certainly would have been possible for the token declarations to have been produced by lex (but see below). Or it would have been possible to force the user to write their own declarations.

It is more convenient for yacc/bison to produce the token numberings, though, because:

The terminals need to be parsed by yacc because they are explicit elements in the grammar productions. In lex, on the other hand, they are part of the unparsed actions and lex can generate code without any explicit knowledge about token values; and
yacc (and bison) produce parse tables which are indexed by terminal and non-terminal numbers; the logic of the tables require that terminals and non-terminals have distinct codes. lex has no way of knowing what the non-terminals are, so it can't generate appropriate codes.

The second argument is a bit weak, because in practice bison-generated parsers renumber token ids to fit them into the id-numbering scheme. Even so, this is only possible if bison is in charge of the actual numbers. (The reason for the renumbering is to make the id value contiguous; by another historical accident, it's normal to reserve codes 0 through 255 for single-character tokens, and 0 for EOF; however, not all the 8-bit codes are actually used by most scanners.)

lex/yacc: why do lexers have to include a parser's header file?

3 Answers