4
votes

I have a problem figuring out how to parse a date in my grammar.

The thing is that it shares its definition with a String, but according to the Antlr 4 documentation, it should follow the precedence by looking at the order of declaration.

Here is my grammar:

grammar formula;


/* entry point */
parse: expr EOF;

expr
    : value                                  # argumentArithmeticExpr
    | l=expr operator=('*'|'/'|'%') r=expr   # multdivArithmeticExpr // TODO: test the % operator
    | l=expr operator=('+'|'-') r=expr       # addsubtArithmeticExpr
    | '-' expr                               # minusArithmeticExpr
    | FUNCTION_NAME '(' (expr ( ','  expr )* ) ? ')'# functionExpr
    | '(' expr ')'                           # parensArithmeticExpr
    ;

value
    : number
    | variable
    | date
    | string
    | bool;

/* Atomes */

bool
    : BOOL
    ;

variable
    : '[' (~(']') | ' ')* ']'
    ;

date
    : DQUOTE date_format DQUOTE
    | QUOTE date_format QUOTE
    ;

date_format
    : year=INT '-' month=INT '-' day=INT (hour=INT ':' minutes=INT ':' seconds=INT)?
    ;

string
    : STRING_LITERAL
    ;


number
    : ('+'|'-')? NUMERIC_LITERAL
    ;


/* lexemes de base */

QUOTE   : '\'';
DQUOTE  : '"';
MINUS   : '-';
COLON   : ':';
DOT     : '.';
PIPE    : '|';
BOOL    : T R U E | F A L S E;

FUNCTION_NAME: IDENTIFIER ;

IDENTIFIER
 : [a-zA-Z_] [a-zA-Z_0-9]* // TODO: do we more chars in this set?
 ;

NUMERIC_LITERAL
 : DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )? // ex: 0.05e3
 | '.' DIGIT+ ( E [-+]? DIGIT+ )? // ex: .05e3
 ;

INT: DIGIT+;

STRING_LITERAL
    :  '\'' ( ~'\'' | '\'\'' )* '\''
    |  '"' ( ~'"' | '""' )* '"'
    ;

WS: [ \t\n]+ -> skip;

UNEXPECTED_CHAR: . ;

fragment DIGIT: [0-9];
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');

The important part here is this:

value
    : number
    | variable
    | date
    | string
    | bool;

date
    : DQUOTE date_format DQUOTE
    | QUOTE date_format QUOTE
    ;

date_format
    : year=INT '-' month=INT '-' day=INT (hour=INT ':' minutes=INT ':' seconds=INT)?
    ;

My grammar expects these things:

  • "a quoted string" -> gives a string
  • "2015-03 TOTOTo" -> gives a string because the date format doesn't match.
  • "2015-03-15" -> gives a date because it matches DQUOTE INT '-' INT '-' INT DQUOTE

And I (tried?) to make sure that the parser tries to match a date before trying to match a string: value: ...| date | string| ....

But when I use the grun utility (and my unit tests...), I can see that it categorizes the date as a string, like if it never bothered to check the date format.

the ast

Can you tell me why it is so? I suspect there's a catch with the order in which I declare my grammar rules, but I tried some permutations and didn't get anything.

1

1 Answers

3
votes

The problem stems from the failure to understand that the lexer runs to completion before any of the parser rules are effectively considered.

That means, the STRING_LITERAL lexer rule will consume all strings, dates included, and output just STRING_LITERAL tokens. The date and related parser subrules are never even considered by the parser.

Perhaps the minimal solution is to modify the STRING_LITERAL lexer rule to

STRING_LITERAL
    :  { notDateString() }? 
    ( QUOTE  .*? QUOTE
    | DQUOTE .*? DQUOTE
    )
    ;

The notDateString predicate requires native code to perform the essential disambiguation between date formats and other strings.

Another alternative is to promote the STRING_LITERAL rule entirely to the parser. Doable, but a bit messy depending on whether there is a need to preserve whitespaces within 'real' strings.

BTW, you may wish to add a token stream dump to your standard series of unit tests.