Flex/Bison: lexing ambiguous tokens

Question

I'm dealing with a tricky problem in my flex/bison lexer/parser.

Here are some flex rules, for roman numerals and arbitrary identifiers:

"I"|"II"|"III"|"IV"|"V"|"VI"|"VII"|"i"|"ii"|"iii"|"iv"|"v"|"vi"|"vii" { return NUMERAL; }

"foobar" { return FOOBAR; }

[A-Za-z0-9_]+ { return IDENTIFIER; }

Now, consider this simple grammar:

%token <numeral> NUMERAL
%token <foobar> FOOBAR
%token <identifier> IDENTIFIER

program 
  : numeral foobar { }
  ;

Finally, here is an example input:

IVfoobar

I intend for this to lex as the numeral IV, followed by a FOOBAR. However, how can I prevent this from lexing as the numeral I followed by the identifier "Vfoobar", or just identifier "IVfoobar", which are both invalid?

Why is IVfoobar an invalid identifier? Or to put it another way, what exactly is a valid identifier? — rici
@rici There is no parsing rule for it, so it results in a parse error. — dylhunn
Well, yes. But the lexer cannot know that. That makes it a valid identifier used incorrectly. — rici
Why don't you rely on spaces to separate tokens like everybody else? — user207421

Quentin Quentin · Accepted Answer · 2017-08-19T09:28:57

If you really want to process this at lexer level, then you have to make sure the rule for IDENTIFIER doesn't match strings starting with a roman numeral (I,II,... vii ...).

That's because Lex selects the rule that matches the longest input.

Maybe excluding roman numeral letters from the first char of an IDENTIFIER would make a satisfying set of valid identifiers?

{?i:[a-z0-9_]{-}[ivxlcdm]}{?i:[a-z0-9_]}* { return IDENTIFIER; }

Flex/Bison: lexing ambiguous tokens

1 Answers