1
votes

I'm starting exploring ANTLR and I'm trying to match this format: (test123 A0020 )

Where :

  • test123 is an Identifier of max 10 characters ( letters and digits )
  • A : Time indicator ( for Am or Pm ), one letter can be either "A" or "P"
  • 0020 : 4 digit format representing the time.

I tried this grammar :

    IDENTIFIER
:
    ( LETTER | DIGIT ) +
;
    INT
:
    DIGIT+
;
fragment
DIGIT
:
    [0-9]
;

fragment
LETTER
:
    [A-Z]
;

WS : [ \t\r\n(\s)+]+ -> channel(HIDDEN) ;
formatter:  '(' information ')';

information : 
information '/' 'A' INT 
        |IDENTIFIER ;

How can I resolve the ambiguity and get the time format matched as 'A' INT not as IDENTIFIER? Also how can I add checks like length of token to the identifier? I tknow that this doesn't work in ANTLR : IDENTIFIER : (DIGIT | LETTER ) {2,10}

UPDATE:

I changed the rules to have semantic checks but I still have the same ambiguity between the identifier and the Time format. here's the modified rules:

formatter
    : information
    | information '-' time
    ;

time :
    timeMode timeCode;  

timeMode:   
    { getCurrentToken().getText().matches("[A,C]")}? MOD
;

timeCode: {getCurrentToken().getText().matches("[0-9]{4}")}?  INT;

information: {getCurrentToken().getText().length() <= 10 }? IDENTIFIER;

MOD:  'A' | 'C';

So the problem is illustrated in the production tree, A0023 is matched to timeMode and the parser is complaining that the timeCode is missing enter image description here

3
Check this question. Although you would have to convert your lexer rules to parser rules. The naive way is to write IDENTIFIER: (LETTER | DIGIT) (LETTER | DIGIT) ... ten times.Mephy
Why not tokenize A0023 as a single TIME token?Bart Kiers
@BartKiers because I want to include actions in the semantic rules later on without having to treat the 'A0023' as a String.( I will have to do operations if I want to separate the timeMode and timeCode ) I actually have the same problem in another parser for distance unit recognition ( format [M]\d{3} for distance in meter or [F]\d{4} in feets )ps_messenger
I'm assuming the following inputs are all identifiers: P123, P12345, P. Correct?Bart Kiers
Correct 1P23 12PP23, also are identifiersps_messenger

3 Answers

1
votes

Here is a way to handle it:

grammar Test;

@lexer::members {
  private boolean isAhead(int maxAmountOfCharacters, String pattern) {
    final Interval ahead = new Interval(this._tokenStartCharIndex, this._tokenStartCharIndex + maxAmountOfCharacters - 1);
    return this._input.getText(ahead).matches(pattern);
  }
}

parse
 : formatter EOF
 ;

formatter
 : information ( '-' time )?
 ;

time
 : timeMode timeCode
 ;

timeMode
 : TIME_MODE
 ;

timeCode
 : {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\d{4}")}?
   IDENTIFIER_OR_INTEGER
 ;

information
 : {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\w*[a-zA-Z]\\w*")}?
   IDENTIFIER_OR_INTEGER
 ;

IDENTIFIER_OR_INTEGER
 : {!isAhead(6, "[AP]\\d{4}(\\D|$)")}? [a-zA-Z0-9]+
 ;

TIME_MODE
 : [AP]
 ;

SPACES
 : [ \t\r\n] -> skip
 ;

A small test class:

public class Main {

    private static void indent(String lispTree) {

        int indentation = -1;

        for (final char c : lispTree.toCharArray()) {
            if (c == '(') {
                indentation++;
                for (int i = 0; i < indentation; i++) {
                    System.out.print(i == 0 ? "\n  " : "  ");
                }
            }
            else if (c == ')') {
                indentation--;
            }
            System.out.print(c);
        }
    }

    public static void main(String[] args) throws Exception {
        TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
        TestParser parser = new TestParser(new CommonTokenStream(lexer));
        indent(parser.parse().toStringTree(parser));
    }
}

will print:

(parse 
  (formatter 
    (information 1P23) - 
    (time 
      (timeMode A) 
      (timeCode 0023))) <EOF>)

for the input "1P23 - A0023".

EDIT

ANTLR also can output the parse tree on UI component. If you do this instead:

public class Main {

    public static void main(String[] args) throws Exception {
        TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
        TestParser parser = new TestParser(new CommonTokenStream(lexer));
        new TreeViewer(Arrays.asList(TestParser.ruleNames), parser.parse()).open();
    }
}

the following dialog will appear:

enter image description here

Tested with ANTLR version 4.5.2-1

0
votes

Using semantic predicates (check this amazing QA), you can define parser rules for your specific model, having logic checks that the information can be parsed. Note this is only an option for parser rules, not lexer rules.

information
    : information '/' meridien time
    | text
    ;
meridien
    : am
    | pm
    ;
am: {input.LT(1).getText() == "A"}? IDENTIFIER;
pm: {input.LT(1).getText() == "P"}? IDENTIFIER;
time: {input.LT(1).getText().length == 4}? INT;
text: {input.LT(1).getText().length <= 10}? IDENTIFIER;
0
votes
compileUnit
    :   alfaNum time
    ;

alfaNum : (ALFA | MOD | NUM)+;
time : MOD NUM+;

MOD:  'A' | 'P';
ALFA: [a-zA-Z];
NUM:  [0-9];

WS
    :   ' ' -> channel(HIDDEN)
    ;

You need to avoid ambiguity by including MOD into alfaNum rule.