How do I resolve the lexical ambiguity between numbers and dates in ANTLR 2?

Question

I have two token types in my lexer defined like this:

NUMBERVALUE
    :    ( '0' .. '9' )+ ( '.' ( '0' .. '9' )+ )?
    ;

DATEVALUE
    :    ( '0' .. '9' ) ( '0' .. '9' ) ( '0' .. '9' ) ( '0' .. '9' ) '-' 
         ( '0' .. '9' ) ( '0' .. '9' ) '-' 
         ( '0' .. '9' ) ( '0' .. '9' )
    |    ( '0' .. '9' ) ( '0' .. '9' ) '-' 
         ( '0' .. '9' ) ( '0' .. '9' ) '-' 
         ( '0' .. '9' ) ( '0' .. '9' )
         ;

I would have thought that, since dates must contain a hyphen within the first five characters, then setting k=5 in the lexer options would be enough that the lexer could always tell the two apart. However, I'm getting this warning when I run antlr:

warning:lexical nondeterminism between rules NUMBERVALUE and DATEVALUE upon
    k==1:'0'..'9'
    k==2:'0'..'9'
    k==3:'0'..'9'
    k==4:'0'..'9'
    k==5:'0'..'9'

and the parser doesn't recognise numbers with more than four digits in them. How do I resolve the lexical ambiguity?

MSalters MSalters · Accepted Answer · 2009-01-12T11:51:11

It seems to me that you run into a spurious warning due to Linear approximate lookahead. The 1st, 2nd, 3rd, 4th and 5th character of DATEVALUE can all be digits, just not all at the same time.

I'd try to get rid of the 2/4 digit year alternative. For starters, you don't want to be responsible for the Y2.1K bug; secondly it saves you an alternative DATEVALUE syntax. Another solution I'd try is to use a different grouping :

DATEVALUE
:    ( '0' .. '9' ) ( '0' .. '9' ) (( '0' .. '9' ) ( '0' .. '9' ))? '-' 
     ( '0' .. '9' ) ( '0' .. '9' ) '-' 
     ( '0' .. '9' ) ( '0' .. '9' )
     ;

I think it's more readable as you don't repeat the month/day part. I would put the optional part first, for readability but I understand that starting with optional parts should be avoided in these cases as it makes the parsing harder.

How do I resolve the lexical ambiguity between numbers and dates in ANTLR 2?

1 Answers