1
votes

In my grammar I have these lexer rules:

DECIMAL_NUMBER: DIGITS? DOT_SYMBOL DIGITS;

// Identifiers might start with a digit, even though it is discouraged.
IDENTIFIER: LETTER_WHEN_UNQUOTED+;

fragment LETTER_WHEN_UNQUOTED:
    '0'..'9'
    | 'A'..'Z' // Only upper case, as we use a case insensitive parser (insensitive only for ASCII).
    | '$'
    | '_'
    | '\u0080'..'\uffff'
;

WHITESPACE: ( ' ' | '\t' | '\f' | '\r'| '\n') { $channel = HIDDEN; };

and this parser rule:

qualified_identifier: IDENTIFIER '.' IDENTIFIER;

This works nicely except for one special case, like this:

... a.0b

The problem here is that .0 is captured by the DECIMAL_NUMBER rule, but I'd need to ignore it if there are non-digit chars directly following any digits. How can this be done?

I was thinking about a validating predicate, but that would completely break parsing if the DECIMAL_NUMBER rule does not match it. Another thought I have was to add an action checking for any char following what has been matched so far and then manually generate tokens, which seems very ugly.

Is it possible to mark the position after the dot and return to it in the input stream when my action code determines this is not a decimal number?

2
Here is the ugly solution with manual token generation: github.com/ibre5041/plsql-parser/blob/master/parsers/…. It parses input 1..5 as NUMBER INTERVAL NUMBER. Another options would be to call "set_type(IDENTIFIER)" inside DECIMAL_NUMBER rule's action. If digits are followed by some characters.ibre5041
But that would make e.g. .0a an identifier, which is wrong. The correct way is to generate a DOT IDENTIFIER token sequence instead.Mike Lischke

2 Answers

1
votes

The DECIMAL_NUMBER rule must be extended to only match if we have a pure decimal number:

DECIMAL_NUMBER:
    DIGITS DOT_SYMBOL DIGITS
    | DOT_SYMBOL {if (!isAllDigits(ctx)) {FAILEDFLAG = ANTLR3_TRUE; return; }} DIGITS
;

I had to use the same code as it is implicitly used by semantic predicates if back tracking is active. Just having a predicate did not do the job however, because of the needed back tracking flag which is not set in this situation.

The function to check the input is this:

  ANTLR3_BOOLEAN isAllDigits(pMySQLLexer ctx)
  {
    int i = 1;
    while (1)
    {
      int input = LA(i++);
      if (input == EOF || input == ' ' || input == '\t' || input == '\n' || input == '\r' || input == '\f')
        return ANTLR3_TRUE;

      // Need to check if any of the valid identifier chars comes here (which would make the entire string to an identifier).
      // For the used values look up the IDENTIFIER lexer rule.
      if ((input >= 'A' && input <= 'Z') || input == '$' || input == '_' || (input >= 0x80 && input <= 0xffff))
        return ANTLR3_FALSE;

      // Everything else but digits is considered valid input for a new token.
      if (input < '0' && input > '9')
        return ANTLR3_TRUE;
    }
  }
0
votes

If memory serves, the lexer is greedy (i.e. looks for the longest token that will match at any given point in the input stream. In a tie, order matters. I'm pretty sure your only solution is to make the dotted identifier a lever rule and then break up the token post-parse (In my grammar, that's how I handled IDs)

Looking at what you've specified, since an IDENTIFIER can start with a number (and is only required to be one or more characters), then I believe you have a lexer ambiguity (is 1.2 a dotted IDENTIFIER or a DECIMAL_NUMBER). You'll probably need to break make the IDENTIFIER token specify two alternatives (one that begins with one or more digits, but is required to have at least on non-digit character, and another that allows for one or more non-digit character. (maybe you've handled that in the real grammar and this is just simplified for the question).