Accumulating / Collecting Errors via ErrorListener to handle after the Parse

Question

The ErrorListener mechanism in Antlr4 is great for logging and making decisions about syntax errors as they occur during a parse, but it can get better for batch error handling after the parse is finished. There are a number of reasons you might want to handle errors after the parse finishes, including:

we need a clean way to programmatically check for errors during a parse and handling them after the fact,
sometimes one syntax error causes several others (when not recovered in line, for instance), so it can be helpful to group or nest these errors by parent context when displaying output to the user and you can't know all the errors until the parse is finished,
you may want to display errors differently to the user depending on how many and how severe they are, for example, a single error that exited a rule or a few errors all recovered in line might just ask the user to fix these local areas - otherwise, you might have the user edit the entire input, and you need to have all the errors to make this determination.

The bottom line is that we can be smarter about reporting and asking users to fix syntax errors if we know the full context in which the errors occurred (including other errors). To do this, I have the following three goals:

a full collection of all the errors from a given parse,
context information for each error, and
severity and recovery information for each error.

I have written code to do #1 and #2, and I'm looking for help on #3. I'm also going to suggest some small changes to make #1 and #2 easier for everyone.

First, to accomplish #1 (a full collection of errors), I created CollectionErrorListener as follows:

public class CollectionErrorListener extends BaseErrorListener {

    private final List<SyntaxError> errors = new ArrayList<SyntaxError>();

    public List<SyntaxError> getErrors() {
        return errors;
    }

    @Override
    public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
        if (e == null) {
            // e is null when the parser was able to recover in line without exiting the surrounding rule.
            e = new InlineRecognitionException(msg, recognizer, ((Parser)recognizer).getInputStream(), ((Parser)recognizer).getContext(), (Token) offendingSymbol);
        }
        this.errors.add(new SyntaxError(msg, e));
    }  
}

And here is my class for InlineRecognitionException:

public class InlineRecognitionException extends RecognitionException {

    public InlineRecognitionException(String message, Recognizer<?, ?> recognizer, IntStream input, ParserRuleContext ctx, Token offendingToken) {
        super(message, recognizer, input, ctx);
        this.setOffendingToken(offendingToken);
    }    
}

And here is my class for the SyntaxError container:

public class SyntaxError extends RecognitionException {

    public SyntaxError(String message, RecognitionException e) {
        super(message, e.getRecognizer(), e.getInputStream(), (ParserRuleContext) e.getCtx());
        this.setOffendingToken(e.getOffendingToken());
        this.initCause(e);
    }
}

This is very similar to the SyntaxErrorListener referred to by 280Z28's answer to Antlr error/exception handling. I need both the InlineRecognitionException and the SyntaxError wrapper because of how the parameters of CollectionErrorListener.syntaxError are filled.

First of all, the RecognitionException parameter "e" is null if the parser recovered from the exception in line (without leaving the rule). We can't just instantiate a new RecognitionException because there is no constructor or method that allows us to set the offending token. Anyway, being able to differentiate errors that were recovered in line (using an instanceof test) is useful information for achieving goal #3, so we can use the class of InlineRecognitionException to indicate in line recovery.

Next, we need the SyntaxError wrapper class because, even when RecognitionException "e" is not null (e.g., when recovery was not in line), the value of e.getMessage() is null (for some unknown reason). We therefore need to store the msg parameter to CollectionErrorListener.syntaxError. Because there is no setMessage() modifier method on RecognitionException, and we can't just instantiate a new RecognitionException (we lose the offending token information as discussed in the previous paragraph), we are left subclassing to be able to set the message, offending token, and cause appropriately.

And this mechanism works really well:

    CollectionErrorListener collector = new CollectionErrorListener();
    parser.addErrorListener(collector);
    ParseTree tree = parser.prog();

    //  ...  Later ...
    for (SyntaxError e : collector.getErrors()) {
        // RecognitionExceptionUtil is my custom class discussed next.
        System.out.println(RecognitionExceptionUtil.formatVerbose(e));
    }

This gets to my next point. Formatting output from a RecognitionException is kinda annoying. Chapter 9 of The Definitive ANTLR 4 Reference book shows how displaying quality error messages means you need to split the input lines, reverse the rule invocation stack, and piece together a lot of stuff from the offending token to explain where the error occurred. And, the following command doesn't work if you are reporting errors after the parse is finished:

// The following doesn't work if you are not reporting during the parse because the
// parser context is lost from the RecognitionException "e" recognizer.
List<String> stack = ((Parser)e.getRecognizer()).getRuleInvocationStack();

The problem is that we have lost the RuleContext, and that is needed for getRuleInvocationStack. Luckily, RecognitionException keeps a copy of our context and getRuleInvocationStack takes a parameter, so here is how we get the rule invocation stack after the parse is finished:

// Pass in the context from RecognitionException "e" to get the rule invocation stack
// after the parse is finished.
List<String> stack = ((Parser)e.getRecognizer()).getRuleInvocationStack(e.getCtx());

In general, it would be especially nice if we had some convenience methods in RecognitionException to make error reporting more friendly. Here is my first attempt at a utility class of methods that could be part of RecognitionException:

public class RecognitionExceptionUtil {

    public static String formatVerbose(RecognitionException e) {
        return String.format("ERROR on line %s:%s => %s%nrule stack: %s%noffending token %s => %s%n%s",
                getLineNumberString(e),
                getCharPositionInLineString(e),
                e.getMessage(),
                getRuleStackString(e),
                getOffendingTokenString(e),
                getOffendingTokenVerboseString(e),
                getErrorLineStringUnderlined(e).replaceAll("(?m)^|$", "|"));
    }

    public static String getRuleStackString(RecognitionException e) {
        if (e == null || e.getRecognizer() == null
                || e.getCtx() == null
                || e.getRecognizer().getRuleNames() == null) {
            return "";
        }
        List<String> stack = ((Parser)e.getRecognizer()).getRuleInvocationStack(e.getCtx());
        Collections.reverse(stack);
        return stack.toString();
    }

    public static String getLineNumberString(RecognitionException e) {
        if (e == null || e.getOffendingToken() == null) {
            return "";
        }
        return String.format("%d", e.getOffendingToken().getLine());
    }

    public static String getCharPositionInLineString(RecognitionException e) {
        if (e == null || e.getOffendingToken() == null) {
            return "";
        }
        return String.format("%d", e.getOffendingToken().getCharPositionInLine());
    }

    public static String getOffendingTokenString(RecognitionException e) {
        if (e == null || e.getOffendingToken() == null) {
            return "";
        }
        return e.getOffendingToken().toString();
    }

    public static String getOffendingTokenVerboseString(RecognitionException e) {
        if (e == null || e.getOffendingToken() == null) {
            return "";
        }
        return String.format("at tokenStream[%d], inputString[%d..%d] = '%s', tokenType<%d> = %s, on line %d, character %d",
                e.getOffendingToken().getTokenIndex(),
                e.getOffendingToken().getStartIndex(),
                e.getOffendingToken().getStopIndex(),
                e.getOffendingToken().getText(),
                e.getOffendingToken().getType(),
                e.getRecognizer().getTokenNames()[e.getOffendingToken().getType()],
                e.getOffendingToken().getLine(),
                e.getOffendingToken().getCharPositionInLine());
    }

    public static String getErrorLineString(RecognitionException e) {
        if (e == null || e.getRecognizer() == null
                || e.getRecognizer().getInputStream() == null
                || e.getOffendingToken() == null) {
            return "";
        }
        CommonTokenStream tokens =
            (CommonTokenStream)e.getRecognizer().getInputStream();
        String input = tokens.getTokenSource().getInputStream().toString();
        String[] lines = input.split(String.format("\r?\n"));
        return lines[e.getOffendingToken().getLine() - 1];
    }

    public static String getErrorLineStringUnderlined(RecognitionException e) {
        String errorLine = getErrorLineString(e);
        if (errorLine.isEmpty()) {
            return errorLine;
        }
        // replace tabs with single space so that charPositionInLine gives us the
        // column to start underlining.
        errorLine = errorLine.replaceAll("\t", " ");
        StringBuilder underLine = new StringBuilder(String.format("%" + errorLine.length() + "s", ""));
        int start = e.getOffendingToken().getStartIndex();
        int stop = e.getOffendingToken().getStopIndex();
        if ( start>=0 && stop>=0 ) {
            for (int i=0; i<=(stop-start); i++) {
                underLine.setCharAt(e.getOffendingToken().getCharPositionInLine() + i, '^');
            }
        }
        return String.format("%s%n%s", errorLine, underLine);
    }
}

There is a lot to be desired in my RecognitionExceptionUtil (always returning strings, not checking that recognizer is of type Parser, not handling multiple lines in getErrorLineString, etc), but I'm hoping you get the idea.

SUMMARY of my suggestions for a future version of ANTLR:

Always populate the "RecognitionException e" parameter of ANTLRErrorListener.syntaxError (including the OffendingToken) so that we can collect these exceptions for batch handling after the parse. While your at it, make sure the e.getMessage() is set to return the value currently in the msg parameter.
Add a constructor for RecognitionException that includes OffendingToken.
Remove the other parameters in the method signature of ANTLRErrorListener.syntaxError since they will be extraneous and lead to confusion.
Add convenience methods in RecognitionException for common stuff such as getCharPositionInLine, getLineNumber, getRuleStack, and the rest of my stuff from my RecognitionExceptionUtil class defined above. Of course, these will have to check for null and also check that recognizer is of type Parser for some of these methods.
When calling ANTLRErrorListener.syntaxError, clone the recognizer so that we don't lose the context when the parse finishes (and we can more easily call getRuleInvocationStack).
If you clone the recognizer, you won't need to store the context in RecognitionException. We can make two changes to e.getCtx(): first, rename it to e.getContext() to make it consistent with Parser.getContext(), and second, make it a convenience method for the recognizer we already have in RecognitionException (checking that recognizer is an instance of Parser).
Include information in RecognitionException about the severity of the error and how the parser recovered. This is my goal #3 from the beginning. It would be great to categorize syntax errors by how well the parser handled it. Did this error blow up the entire parse or just show up as a blip in line? How many and which tokens were skipped / inserted?

So, I'm looking for feedback on my three goals and especially any suggestions for gathering more information about goal #3: severity and recovery information for each error.

Jonathan D'Andries Jonathan D'Andries · Accepted Answer · 2014-01-07T14:48:54

I posted these suggestions to the Antlr4 GitHub Issue list and received the below reply. I believe that the ANTLRErrorListener.syntaxError method contains redundant / confusing parameters and requires a lot of API knowledge to use properly, but I understand the decision. Here is the link to the issue and a copy of the text response:

From: https://github.com/antlr/antlr4/issues/396

Regarding your suggestions:

Populating the RecognitionException e argument to syntaxError: As mentioned in the documentation:

The RecognitionException is non-null for all syntax errors except when we discover mismatched token errors that we can recover from in-line, without returning from the surrounding rule (via the single token insertion and deletion mechanism).

Adding a constructor to RecognitionException with the offending token: This is not really relevant to this issue, and would be addressed separately (if at all).
Removing parameters from syntaxError: This would not only introduce breaking changes for users who have implemented this method in previous releases of ANTLR 4, but it would eliminate the ability to report the available information for errors which occurred inline (i.e. errors where no RecognitionException is available).
Convenience methods in RecognitionException: This is not really relevant to this issue, and would be addressed separately (if at all). (Further note: It's hard enough as-is to document the API. This just adds more ways to do things that are already readily accessible, so I oppose this change.)
Cloning the recognizer when calling syntaxError: This is a performance-critical method, so new objects are only created when absolutely necessary.
"If cloning the recognizer": The recognizer will never be cloned before calling syntaxError.
This information can be stored in an associative map in your implementation of ANTLRErrorListener and/or ANTLRErrorStrategy if necessary for your application.

I'm closing this issue for now since I don't see any action items requiring changes to the runtime from this list.

Accumulating / Collecting Errors via ErrorListener to handle after the Parse

1 Answers