2
votes

I am just trying to learn flex and here is a sample code in flex to detect identifiers and digits. I want to improve the code by identifying wrong identifier and digit patterns (for example: 1var,12.2.2,5. etc). How I will detect it? which change do I have to make in the code?

My sample code is given below:

ID       [a-zA-z][a-zA-z0-9]*
DIGIT    [0-9]

%%
[\t]+
{ID}     {printf("\n identifier found");}
{DIGIT}  {printf("\nDigit found");}
.        {}
%%

int main(int argc, char *argv[]){         
    yylex();         
}
1

1 Answers

2
votes

This is not a trivial question, as what errors are detected in the lexer is very much part of the whole design of a language processing system and the nature of the syntactic and lexical structure of the language. Some elements that may, on inspection seem like lexical errors, may turn out not to be. It really depends on the nature of the language; for example, in Fortran, spaces have no meaning, and there is the famous example:

        DO 10 I = 1.10

Is this the keyword DO, the label 10, the identifier I, the operator = and the number 1.10? Actually, it is the identifier DO10I... etc; whereas

        DO 10 I = 1,10

Does have the keyword DO...

So sometimes, when seeing the sequence, 123abc, you cannot automatically assume it is just an invalid identifier. Sometimes it is just better to return it as the two valid tokens NUMBER and IDENTIFIER and leave it to the parser to report any errors that result. The only difficult area to be careful with this approach is when exponents are specified in floating point number constants, and when integer ranges are used. An example of an exponent use would be:

-1234.457E+12

This has a letter embedded in a number, and would need to be returned as some kind of NUMBER token. Similarly the overloading on the sign operators cause problems for lexical analysis error detection. In the previous number it has two signs - and +. If they are recognised as part of the number, when do the symbols - and + get recognised as the SUBTRACT and ADD tokens? Take for example this expression:

i=i-1;

Is this IDENTIFIER, EQUALS,IDENTIFIER,NUMBER? No, of course not. So this means that we cannot always assume that -1 is just a NUMBER.

The integer range, mentioned earlier, which in many languages (Pascal in particular) is represented as 1..8, using two dots to indicate an upper and lower bound, causes difficulties when handling floating point expressions like 1.2.

So, just the question, "How do I checked for ill-formed identifiers and numbers in a lexer?" Is quite loaded, and shows it might represent someone who has not fully absorbed the subject area. Often questions like this a posted in class tests, as they are a good way for the instructor to see whether the student possesses the deeper knowledge of language processing, or just answers it in a surface way, and attempts to write patterns for such objects.

As just mentioned, the naive answer would be just to write regular expression patterns to match the examples of invalid lexemes.

For example, I could add the patterns:

[0-9]+\.[0-9]+(\.[0-9]+)+    {printf("Bad float: %s\n", yytext);}
[0-9]+[a-zA-Z][a-zA-Z0-9]+   {printf("Bad Identifier: %s\n", yytext);}

But usually this is not done in most compilers. The only lexical errors detected by most compilers would be unclosed strings and comments. This is also the reason why most languages do not allow newlines in strings, because then unclosed strings can easily be detected.