ANTLR: Lexer rule catching what is supposed to be handled by parser rule

Question

In my grammar I have these lexer rules:

DECIMAL_NUMBER: DIGITS? DOT_SYMBOL DIGITS;

// Identifiers might start with a digit, even though it is discouraged.
IDENTIFIER: LETTER_WHEN_UNQUOTED+;

fragment LETTER_WHEN_UNQUOTED:
    '0'..'9'
    | 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
    | '$'
    | '_'
    | '\u0080'..'\uffff'
;

WHITESPACE: ( ' ' | '\t' | '\f' | '\r'| '\n') { $channel = HIDDEN; };

and this parser rule:

qualified_identifier: IDENTIFIER '.' IDENTIFIER;

This works nicely except for one special case, like this:

... a.0b

The problem here is that .0 is captured by the DECIMAL_NUMBER rule, but I'd need to ignore it if there are non-digit chars directly following any digits. How can this be done?

I was thinking about a validating predicate, but that would completely break parsing if the DECIMAL_NUMBER rule does not match it. Another thought I have was to add an action checking for any char following what has been matched so far and then manually generate tokens, which seems very ugly.

Is it possible to mark the position after the dot and return to it in the input stream when my action code determines this is not a decimal number?

Here is the ugly solution with manual token generation: github.com/ibre5041/plsql-parser/blob/master/parsers/…. It parses input 1..5 as NUMBER INTERVAL NUMBER. Another options would be to call "set_type(IDENTIFIER)" inside DECIMAL_NUMBER rule's action. If digits are followed by some characters. — ibre5041
But that would make e.g. .0a an identifier, which is wrong. The correct way is to generate a DOT IDENTIFIER token sequence instead. — Mike Lischke

Mike Lischke Mike Lischke · Accepted Answer · 2015-11-16T14:49:18

The DECIMAL_NUMBER rule must be extended to only match if we have a pure decimal number:

DECIMAL_NUMBER:
    DIGITS DOT_SYMBOL DIGITS
    | DOT_SYMBOL {if (!isAllDigits(ctx)) {FAILEDFLAG = ANTLR3_TRUE; return; }} DIGITS
;

I had to use the same code as it is implicitly used by semantic predicates if back tracking is active. Just having a predicate did not do the job however, because of the needed back tracking flag which is not set in this situation.

The function to check the input is this:

  ANTLR3_BOOLEAN isAllDigits(pMySQLLexer ctx)
  {
    int i = 1;
    while (1)
    {
      int input = LA(i++);
      if (input == EOF || input == ' ' || input == '\t' || input == '\n' || input == '\r' || input == '\f')
        return ANTLR3_TRUE;

      // Need to check if any of the valid identifier chars comes here (which would make the entire string to an identifier).
      // For the used values look up the IDENTIFIER lexer rule.
      if ((input >= 'A' && input <= 'Z') || input == '$' || input == '_' || (input >= 0x80 && input <= 0xffff))
        return ANTLR3_FALSE;

      // Everything else but digits is considered valid input for a new token.
      if (input < '0' && input > '9')
        return ANTLR3_TRUE;
    }
  }

ANTLR: Lexer rule catching what is supposed to be handled by parser rule

2 Answers