1
votes

I'm having some trouble with a semantic predicate on an ANTLR parser rule. Here's my grammar, intended to recognize a couple different date formats:

grammar sample ;

options { language=Python3; }

@parser::header {
from datetime import datetime
}

month_number returns [val] : INTEGER { 1    <= int($INTEGER.text) <= 12   }?  {$val = int($INTEGER.text)} ;
day_number   returns [val] : INTEGER { 1    <= int($INTEGER.text) <= 31   }?  {$val = int($INTEGER.text)} ;
year_4digit  returns [val] : INTEGER { 1900 <= int($INTEGER.text) <= 2100 }?  {$val = int($INTEGER.text)} ;

year_2digit  returns [val] : '\''? INTEGER {(int($INTEGER.text) >= 65 or int($INTEGER.text) < 40)}?
                                     {$val = (1900 + int($INTEGER.text)) if (int($INTEGER.text) >= 65) else (2000 + int($INTEGER.text))} ;

year_digits  returns [val]
  : year_4digit {$val = $year_4digit.val}
  | year_2digit {$val = $year_2digit.val}
  ;


mdy returns [val]
  : month_number '-' day_number '-' year_digits  {$val = datetime($year_digits.val, $month_number.val, $day_number.val)}
  | month_number '/' day_number '/' year_digits  {$val = datetime($year_digits.val, $month_number.val, $day_number.val)}
  ;

ymd returns [val]
  : year_4digit '-' month_number '-' day_number  {$val = datetime($year_4digit.val, $month_number.val, $day_number.val)}
  | year_4digit '/' month_number '/' day_number  {$val = datetime($year_4digit.val, $month_number.val, $day_number.val)}
  ;

date_as_numbers returns [val]
  : ymd {$val = $ymd.val}
  | mdy {$val = $mdy.val}
  ;

INTEGER: '0'..'9'+ ;

I test that with the following program:

from myPackage.sampleParser import sampleParser
from myPackage.sampleLexer import sampleLexer

from antlr4 import CommonTokenStream
from antlr4 import InputStream

date_input = InputStream("2/12/2017".lower())
lexer = sampleLexer(date_input)
stream = CommonTokenStream(lexer)
parser = sampleParser(stream)
result = parser.date_as_numbers()
print(result.val)

This results in the following error:

line 1:1 rule year_4digit failed predicate: { 1900 <= int($INTEGER.text) <= 2100 }?
line 1:9 rule day_number failed predicate: { 1    <= int($INTEGER.text) <= 31   }?
Traceback (most recent call last):
  File "/Users/kwilliams/Library/Preferences/IntelliJIdea2017.3/scratches/scratch_1.py", line 11, in <module>
    result = parser.date_as_numbers()
  File "/Users/kwilliams/git/myPackage/sampleParser.py", line 482, in date_as_numbers
    localctx._ymd = self.ymd()
  File "/Users/kwilliams/git/myPackage/sampleParser.py", line 436, in ymd
    localctx.val = datetime(localctx._year_4digit.val, localctx._month_number.val, localctx._day_number.val)
TypeError: an integer is required (got type NoneType)

So what I believe is happening is that the predicate in year_4digit throws an exception because the number 2 isn't in its range, but it returns a year_4digit match anyway, which hasn't had its val attribute populated, causing a downstream error about NoneType. Is that correct?

If so - what's a good solution? Do I need to put the semantic predicates earlier in the rules or something? How would I do a lookahead to the INTEGER token if that's the right solution?

(Also - I expected to be able to do $INTEGER.int instead of int($INTEGER.text), but maybe that's not available in the Python target? Tangential and minor issue.)

BTW, the above grammar is a smallish excerpt from my real grammar, I'm hoping that there's a solution that doesn't require major changes to this part, potentially causing ripple effects that might take a while to sort out.

Thanks.

1
I fixed my example, I was mistakenly calling ymd directly instead of the date_as_numbers rule.Ken Williams

1 Answers

1
votes

Apparently, the predicates are nested too deep causing the parser not to backtrack and try the second alternative:

date_as_numbers returns [val]
  : ymd {$val = $ymd.val} // alternaitve 1
  | mdy {$val = $mdy.val} // alternaitve 2
  ;

When I swap the alternatives:

date_as_numbers returns [val]
  : mdy {$val = $mdy.val}
  | ymd {$val = $ymd.val}
  ;

the input "2/12/2017" is parsed correctly, but then "2017/12/2" fails.

I don't know if this is expected behaviour, or a bug (I've never done much with the new v4 predicates yet). You could raise an issue about this.

After playing around a bit, I've got something working by merging the rules into 1 big any_date rule, and letting these rules start with the predicate rather than having a predicate somewhere in the middle (as you yourself already hinted about):

grammar sample;

@parser::members {
  boolean lte(Token token, int value) {
    return Integer.parseInt(token.getText()) <= value;
  }
  boolean gte(Token token, int value) {
    return Integer.parseInt(token.getText()) >= value;
  }
}

date_as_numbers returns [String val]
  : any_date EOF {$val = $any_date.val;}
  ;

any_date returns [String val]
 : {gte(_input.LT(1), 1) && lte(_input.LT(1), 12)}?
   INTEGER '-' day_number '-' year_digits {$val = "y=" + $year_digits.val + ", m=" + $INTEGER.text + ", d=" + $day_number.val;}
 | {gte(_input.LT(1), 1) && lte(_input.LT(1), 12)}?
   INTEGER '/' day_number '/' year_digits {$val = "y=" + $year_digits.val + ", m=" + $INTEGER.text + ", d=" + $day_number.val;}
 | {gte(_input.LT(1), 1900) && lte(_input.LT(1), 2100)}?
   INTEGER '-' month_number '-' day_number {$val = "y=" + $INTEGER.text + ", m=" + $month_number.val + ", d=" + $day_number.val;}
 | {gte(_input.LT(1), 1900) && lte(_input.LT(1), 2100)}?
   INTEGER '/' month_number '/' day_number {$val = "y=" + $INTEGER.text + ", m=" + $month_number.val + ", d=" + $day_number.val;}
 ;

month_number returns [int val]
 : INTEGER {gte($INTEGER, 1) && lte($INTEGER, 12)}?
   {$val = Integer.parseInt($INTEGER.text);}
 ;

day_number returns [int val]
 : INTEGER {gte($INTEGER, 1) && lte($INTEGER, 31)}?
   {$val = Integer.parseInt($INTEGER.text);}
 ;

year_4digit returns [int val]
 : INTEGER {gte($INTEGER, 1900) && lte($INTEGER, 2100)}?
   {$val = Integer.parseInt($INTEGER.text);}
 ;

year_2digit returns [int val]
 : '\''? INTEGER {gte($INTEGER, 65) || lte($INTEGER, 39)}?
   {$val = Integer.parseInt($INTEGER.text) >= 65 ? 1900 + Integer.parseInt($INTEGER.text) : 2000 + Integer.parseInt($INTEGER.text);}
 ;

year_digits  returns [int val]
  : year_4digit {$val = $year_4digit.val;}
  | year_2digit {$val = $year_2digit.val;}
  ;

INTEGER: '0'..'9'+ ;

(sorry, no python)

When running this class:

import org.antlr.v4.runtime.*;

public class Main {

  public static void main(String[] args) {

    String[] tests = { "2/12/2017", "2017/12/31", "1-2-'03" };

    for (String test : tests) {
      sampleLexer lexer = new sampleLexer(CharStreams.fromString(test));
      sampleParser parser = new sampleParser(new CommonTokenStream(lexer));
      System.out.println(test + " -> " + parser.date_as_numbers().val);
    }
  }
}

the following is printed:

2/12/2017 -> y=2017, m=2, d=12
2017/12/31 -> y=2017, m=12, d=31
1-2-'03 -> y=2003, m=1, d=2

I know, not perfect, but perhaps you can tweak your current grammar a bit and get something working.

EDIT

Of course, you could also ditch the predicates and do something like this instead:

grammar sample;

date_as_numbers
 : ymd
 | mdy
 | failure
 ;

ymd
 : year '/' month '/' day
 | year '-' month '-' day
 ;

mdy
 : month '/' day '/' year
 | month '-' day '-' year
 ;

year
 : '\''? year_2digits
 | NUM_4DIGITS
 ;

year_2digits
 : NUM_1_12
 | NUM_13_31
 | NUM_2DIGITS
 ;

month
 : NUM_1_12
 ;

day
 : NUM_1_12
 | NUM_13_31
 ;

failure
 : NUM_OTHER
 ;

NUM_1_12
 : [1-9]     // 1..9
 | '1' [0-2] // 10..12
 ;

NUM_13_31
 : '1' [3-9] // 13..19
 | '2' D     // 20..29
 | '3' [01]  // 30..31
 ;

NUM_2DIGITS
 : D D
 ;

NUM_4DIGITS
 : '19' D D // 1900..1999
 | '20' D D // 2000..2099
 | '2100'   // 2100
 ;

NUM_OTHER
 : D+
 ;

fragment D : [0-9];