Antlr4 discards remaining tokens instead of bailing out

Question

I am using Antlr4, and here is a simplified grammar I wrote:

grammar BooleanExpression;

/*******************************
 *      Parser Rules
 *******************************/
booleanTerm
    : booleanLiteral (KW_OR booleanLiteral)+
    | booleanLiteral
    ;

id
    : IDENTIFIER
    ;

booleanLiteral
    : KW_TRUE
    | KW_FALSE
    ;

/*******************************
 *         Lexer Rules
 *******************************/
KW_TRUE
    : 'true'
    ;

KW_FALSE
    : 'false'
    ;

KW_OR
    : 'or'
    ;   

IDENTIFIER
    : (SIMPLE_LATIN)+
    ;

fragment 
SIMPLE_LATIN
    : 'A' .. 'Z'
    | 'a' .. 'z'
    ;

WHITESPACE
    : [ \t\n\r]+ -> skip
    ;

I used a BailErrorStategy and BailLexer like below:

public class BailErrorStrategy extends DefaultErrorStrategy {
    /**
     * Instead of recovering from exception e, rethrow it wrapped in a generic
     * IllegalArgumentException so it is not caught by the rule function catches.
     * Exception e is the "cause" of the IllegalArgumentException.
     */

    @Override
    public void recover(Parser recognizer, RecognitionException e) {
        throw new IllegalArgumentException(e);
    }

    /**
     * Make sure we don't attempt to recover inline; if the parser successfully
     * recovers, it won't throw an exception.
     */
    @Override
    public Token recoverInline(Parser recognizer) throws RecognitionException {
        throw new IllegalArgumentException(new InputMismatchException(recognizer));
    }

    /** Make sure we don't attempt to recover from problems in subrules. */
    @Override
    public void sync(Parser recognizer) {
    }

    @Override
    protected Token getMissingSymbol(Parser recognizer) {
        throw new IllegalArgumentException(new InputMismatchException(recognizer));
    }
}



 public class BailLexer extends BooleanExpressionLexer {
    public BailLexer(CharStream input) {
        super(input);
        //removeErrorListeners();
        //addErrorListener(new ConsoleErrorListener());
    }

    @Override
    public void recover(LexerNoViableAltException e) {
        throw new IllegalArgumentException(e); // Bail out
    }

    @Override
    public void recover(RecognitionException re) {
        throw new IllegalArgumentException(re); // Bail out
    }
}

Everything works okay except one case. I tried the following expression:

true OR false

I expect this expression to be rejected and an IllegalArgumentException is thrown because the 'or' token should be lower case instead of upper case. But it turned out Antlr4 didn't reject this expression and the expression is tokenized into "KW_TRUE IDENTIFIER KW_FALSE" (which is expected, upper case 'OR' will be considered as an IDENTIFIER), but the parser didn't throw an error during processing this token stream and parsed it into a tree containing only "true" and discarded the remaining "IDENTIFIER KW_FALSE" tokens. I tried different prediction modes but all of them worked like above. I have no idea why it works like this and did some debugging, and it eventually led to to this piece of code in Antlr:

ATNConfigSet reach = computeReachSet(previous, t, false);

if ( reach==null ) {
    // if any configs in previous dipped into outer context, that
    // means that input up to t actually finished entry rule
    // at least for SLL decision. Full LL doesn't dip into outer
    // so don't need special case.
    // We will get an error no matter what so delay until after
    // decision; better error message. Also, no reachable target
    // ATN states in SLL implies LL will also get nowhere.
    // If conflict in states that dip out, choose min since we
    // will get error no matter what.
    int alt = getAltThatFinishedDecisionEntryRule(previousD.configs);
    if ( alt!=ATN.INVALID_ALT_NUMBER ) {
        // return w/o altering DFA
        return alt;
    }
    throw noViableAlt(input, outerContext, previous, startIndex);
}

The code "int alt = getAltThatFinishedDecisionEntryRule(previousD.configs);" returned the second alternative in booleanTerm (because "true" matches the second alternative "booleanLiteral") but since it is not equal to ATN.INVALID_ALT_NUMBER, noViableAlt is not thrown immediately. According to the Java comments there, "We will get an error no matter what, so delay until after decision" but it seems no error was thrown eventually.

I really have no idea how to make Antlr reports an error in this case, could some one shed me some light on this? Any help is appreciated, thanks.

Maybe not all tokens are consumed? What happens if you force the parser to parse all the way to the end-of-input: parse : booleanTerm EOF; — Bart Kiers

Sam Harwell Sam Harwell · Accepted Answer · 2013-02-28T14:24:22

If your top-level rule does not end with an explicit EOF, then ANTLR is not required to parse to the end of the input sequence. Rather than throw an exception, it simply parsed the valid portion of the sequence you gave it.

The following start rule would force it to parse the entire input sequence as a single booleanTerm.

start : booleanTerm EOF;

Also, BailErrorStrategy is provided by the ANTLR 4 runtime, and throws a more informative ParseCancellationException than the one shown in your example.

Antlr4 discards remaining tokens instead of bailing out

1 Answers