How to debug ANTLR4 grammar extraneous / mismatched input error

Question

I want to parse the Rulebook "demo.rb" files like below:

rulebook Titanic-Normalization {
  version 1

  meta {
    description "Test"
    source "my-rules.xslx"
    user "joltie"
  }

  rule remove-first-line {
    description "Removes first line when offset is zero"
    when(present(offset) && offset == 0) then {
      filter-row-if-true true;
    }
  }
}

I wrote the ANTLR4 grammar file Rulebook.g4 like below. For now, it can parse the *.rb files generally well, but it throw unexpected error when encounter the "expression" / "statement" rules.

grammar Rulebook;

rulebookStatement
    :   KWRulebook
        (GeneralIdentifier | Identifier)
        '{'
        KWVersion
        VersionConstant
        metaStatement
        (ruleStatement)+
        '}'
    ;

metaStatement
    :   KWMeta
        '{'
        KWDescription
        StringLiteral
        KWSource
        StringLiteral
        KWUser
        StringLiteral
        '}'
    ;

ruleStatement
    :   KWRule
        (GeneralIdentifier | Identifier)
        '{'
        KWDescription
        StringLiteral
        whenThenStatement
        '}'
    ;

whenThenStatement
    :   KWWhen '(' expression ')'
        KWThen '{' statement '}'
    ;

primaryExpression
    :   GeneralIdentifier
    |   Identifier
    |   StringLiteral+
    |   '(' expression ')'
    ;

postfixExpression
    :   primaryExpression
    |   postfixExpression '[' expression ']'
    |   postfixExpression '(' argumentExpressionList? ')'
    |   postfixExpression '.' Identifier
    |   postfixExpression '->' Identifier
    |   postfixExpression '++'
    |   postfixExpression '--'
    ;

argumentExpressionList
    :   assignmentExpression
    |   argumentExpressionList ',' assignmentExpression
    ;

unaryExpression
    :   postfixExpression
    |   '++' unaryExpression
    |   '--' unaryExpression
    |   unaryOperator castExpression
    ;

unaryOperator
    :   '&' | '*' | '+' | '-' | '~' | '!'
    ;

castExpression
    :   unaryExpression
    |   DigitSequence // for
    ;

multiplicativeExpression
    :   castExpression
    |   multiplicativeExpression '*' castExpression
    |   multiplicativeExpression '/' castExpression
    |   multiplicativeExpression '%' castExpression
    ;

additiveExpression
    :   multiplicativeExpression
    |   additiveExpression '+' multiplicativeExpression
    |   additiveExpression '-' multiplicativeExpression
    ;

shiftExpression
    :   additiveExpression
    |   shiftExpression '<<' additiveExpression
    |   shiftExpression '>>' additiveExpression
    ;

relationalExpression
    :   shiftExpression
    |   relationalExpression '<' shiftExpression
    |   relationalExpression '>' shiftExpression
    |   relationalExpression '<=' shiftExpression
    |   relationalExpression '>=' shiftExpression
    ;

equalityExpression
    :   relationalExpression
    |   equalityExpression '==' relationalExpression
    |   equalityExpression '!=' relationalExpression
    ;

andExpression
    :   equalityExpression
    |   andExpression '&' equalityExpression
    ;

exclusiveOrExpression
    :   andExpression
    |   exclusiveOrExpression '^' andExpression
    ;

inclusiveOrExpression
    :   exclusiveOrExpression
    |   inclusiveOrExpression '|' exclusiveOrExpression
    ;

logicalAndExpression
    :   inclusiveOrExpression
    |   logicalAndExpression '&&' inclusiveOrExpression
    ;

logicalOrExpression
    :   logicalAndExpression
    |   logicalOrExpression '||' logicalAndExpression
    ;

conditionalExpression
    :   logicalOrExpression ('?' expression ':' conditionalExpression)?
    ;

assignmentExpression
    :   conditionalExpression
    |   unaryExpression assignmentOperator assignmentExpression
    |   DigitSequence // for
    ;

assignmentOperator
    :   '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|='
    ;

expression
    :   assignmentExpression
    |   expression ',' assignmentExpression
    ;

statement
    :   expressionStatement
    ;

expressionStatement
    :   expression+ ';'
    ;


KWRulebook: 'rulebook';
KWVersion: 'version';
KWMeta: 'meta';
KWDescription: 'description';
KWSource: 'source';
KWUser: 'user';
KWRule: 'rule';
KWWhen: 'when';
KWThen: 'then';
KWTrue: 'true';
KWFalse: 'false';

fragment
LeftParen : '(';

fragment
RightParen : ')';

fragment
LeftBracket : '[';

fragment
RightBracket : ']';

fragment
LeftBrace : '{';

fragment
RightBrace : '}';


Identifier
    :   IdentifierNondigit
        (   IdentifierNondigit
        |   Digit
        )*
    ;

GeneralIdentifier
    :   Identifier
        ('-' Identifier)+
    ;

fragment
IdentifierNondigit
    :   Nondigit
    //|   // other implementation-defined characters...
    ;

VersionConstant
    :   DigitSequence ('.' DigitSequence)*
    ;

DigitSequence
    :   Digit+
    ;

fragment
Nondigit
    :   [a-zA-Z_]
    ;

fragment
Digit
    :   [0-9]
    ;

StringLiteral
    :   '"' SCharSequence? '"'
    |   '\'' SCharSequence? '\''
    ;

fragment
SCharSequence
    :   SChar+
    ;

fragment
SChar
    :   ~["\\\r\n]
    |   '\\\n'   // Added line
    |   '\\\r\n' // Added line
    ;

Whitespace
    :   [ \t]+
        -> skip
    ;

Newline
    :   (   '\r' '\n'?
        |   '\n'
        )
        -> skip
    ;

BlockComment
    :   '/*' .*? '*/'
        -> skip
    ;

LineComment
    :   '//' ~[\r\n]*
        -> skip
    ;

I tested the Rulebook parser with unit test like below:

    public void testScanRulebookFile() throws IOException {
        String fileName = "C:\\rulebooks\\demo.rb";
        FileInputStream fis = new FileInputStream(fileName);
        // create a CharStream that reads from standard input
        CharStream input = CharStreams.fromStream(fis);

        // create a lexer that feeds off of input CharStream
        RulebookLexer lexer = new RulebookLexer(input);

        // create a buffer of tokens pulled from the lexer
        CommonTokenStream tokens = new CommonTokenStream(lexer);

        // create a parser that feeds off the tokens buffer
        RulebookParser parser = new RulebookParser(tokens);


        RulebookStatementContext context = parser.rulebookStatement();
//        WhenThenStatementContext context = parser.whenThenStatement();

        System.out.println(context.toStringTree(parser));

//      ParseTree tree = parser.getContext(); // begin parsing at init rule
//      System.out.println(tree.toStringTree(parser)); // print LISP-style tree
    }

For the "demo.rb" as above, the parser got the error as below. I also print the RulebookStatementContext as toStringTree.

line 12:25 mismatched input '&&' expecting ')'
(rulebookStatement rulebook Titanic-Normalization { version 1 (metaStatement meta { description "Test" source "my-rules.xslx" user "joltie" }) (ruleStatement rule remove-first-line { description "Removes first line when offset is zero" (whenThenStatement when ( (expression (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (postfixExpression (primaryExpression present)) ( (argumentExpressionList (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (primaryExpression offset))))))))))))))))) ))))))))))))))))) && offset == 0 ) then { filter-row-if-true true ;) }) })

I also write the unit test to test short input context like "when (offset == 0) then {\n" + "filter-row-if-true true;\n" + "}\n" to debug the problem. But it still got the error like:

line 1:16 mismatched input '0' expecting {'(', '++', '--', '&&', '&', '*', '+', '-', '~', '!', Identifier, GeneralIdentifier, DigitSequence, StringLiteral}
line 2:19 extraneous input 'true' expecting {'(', '++', '--', '&&', '&', '*', '+', '-', '~', '!', ';', Identifier, GeneralIdentifier, DigitSequence, StringLiteral}

With two day's tries, I didn't got any progress. The question is so long as above, please someone give me some advises about how to debug ANTLR4 grammar extraneous / mismatched input error.

Raven Raven · Accepted Answer · 2017-08-10T16:13:45

I don't know if there are any more sophisticated methods to debug a grammar/parser but here's how I usally do it:

Reduce the input that causes the problem to as few characters as possible.
Reduce your grammar as far as possible so that it still produces the same error on the respective input (most of the time that means wrinting a minimal grammar for the reduced input by recycling the rules of the original grammar (simplifying as far as possible)
Make sure the lexer segments the input properly (for that the feature in ANTLRWorks that shows you the lexer output is great)
Have a look at the ParseTree. ANTLR's testRig has a feature that displays the ParseTree graphically (You can access this functionality either through ANTLRWorks or by ANTLR's TreeViewer) so you can have a look where the parser's interpretation differs from the one you have
Do the parsing "by hand". That means you will take your grammar and go through the input by yourself, step by step and try to apply no logic or assumptions/knowledge/etc. during that process. Just follow through your own grammar as a computer would do it. Question every step you take (Is there another way to match the input) and always try to match the input in another way than the one you actually want it to be parsed

Try to fix the error in the minimal grammar and migrate the solution to your real grammar afterwards.

How to debug ANTLR4 grammar extraneous / mismatched input error

3 Answers