ANTLR4 semantic predicates mess with error recovery. Why?

Question

I'm facing a somewhat strange behavior in error recovery when using semantic predicates.

I need error recovery (specifically the single token insertion) for the texts I'll be parsing have many "single missing token" errors.

I also need semantic predicates because for something like ANTLR4: Matching all input alternatives exaclty once (the second alternative).

But it seems the two don't mix well (I've previously seen this and asked SO for help: ANTLR4 DefaultErrorStrategy fails to inject missing token; then I found an answer; now I don't).

Let the grammar (so simple, it matches any number of "A", separated with spaces, terminated by semi-colon):

grammar AAAGrammar;

WS : ' '+ -> channel(HIDDEN);
A : 'A';
SEMICOLON : ';';


aaaaa : 
    a* ';'
  ;

a :
    A
  ;

Running over the following inputs, the resulting parse tree is:

"A A A;": (aaaaa (a A) (a A) (a A) ;);
"A A A" : (aaaaa (a A) (a A) (a A) <missing ';'>) (this one emits a warning on stderr: line 1:5 missing ';' at '').

That's what I expect, and exactly what I want (the missing semi-colon of the 2nd input was correctly injected).

Now a simple change to grammar, introducing a semantic predicate (this one innocuous, but I understand that ANTLR4 does not - and should not - evaluate this) in the "a" rule, to make it:

a :
    {true}? A
  ;

Running it again over the same inputs: - "A A A;": (aaaaa (a A) (a A) (a A) ;); - "A A A" : (aaaaa (a A) (a A) (a A)) (this one also emits a warning on stderr: line 1:5 no viable alternative at input '').

So semantic predicates totally messed up with missing token injection.

Is this expected?

Why?

Is there any ANTLR4 grammar trick to get error recovery back, without removing the sempred?

Edit: (in reply to @CoronA comment)

Here the diff -u between generated parsers (without and with semantic predicate):

--- withoutsempred.java 2015-05-04 09:39:22.644069398 -0300
+++ withsempred.java    2015-05-04 09:39:13.400046354 -0300
@@ -56,22 +56,24 @@
    public final AaaaaContext aaaaa() throws RecognitionException {
        AaaaaContext _localctx = new AaaaaContext(_ctx, getState());
        enterRule(_localctx, 0, RULE_aaaaa);
-       int _la;
        try {
+           int _alt;
            enterOuterAlt(_localctx, 1);
            {
            setState(7);
            _errHandler.sync(this);
-           _la = _input.LA(1);
-           while (_la==A) {
-               {
-               {
-               setState(4); a();
-               }
+           _alt = getInterpreter().adaptivePredict(_input,0,_ctx);
+           while ( _alt!=2 && _alt!=-1 ) {
+               if ( _alt==1 ) {
+                   {
+                   {
+                   setState(4); a();
+                   }
+                   } 
                }
                setState(9);
                _errHandler.sync(this);
-               _la = _input.LA(1);
+               _alt = getInterpreter().adaptivePredict(_input,0,_ctx);
            }
            setState(10); match(SEMICOLON);
            }
@@ -101,7 +103,9 @@
        try {
            enterOuterAlt(_localctx, 1);
            {
-           setState(12); match(A);
+           setState(12);
+           if (!( true )) throw new FailedPredicateException(this, " true ");
+           setState(13); match(A);
            }
        }
        catch (RecognitionException re) {
@@ -115,12 +119,25 @@
        return _localctx;
    }

+   public boolean sempred(RuleContext _localctx, int ruleIndex, int predIndex) {
+       switch (ruleIndex) {
+       case 1: return a_sempred((AContext)_localctx, predIndex);
+       }
+       return true;
+   }
+   private boolean a_sempred(AContext _localctx, int predIndex) {
+       switch (predIndex) {
+       case 0: return  true ;
+       }
+       return true;
+   }
+
    public static final String _serializedATN =
-       "\3\uacf5\uee8c\u4f5d\u8b0d\u4a45\u78bd\u1b2f\u3378\3\5\21\4\2\t\2\4\3"+
-       "\t\3\3\2\7\2\b\n\2\f\2\16\2\13\13\2\3\2\3\2\3\3\3\3\3\3\2\4\2\4\2\2\17"+
-       "\2\t\3\2\2\2\4\16\3\2\2\2\6\b\5\4\3\2\7\6\3\2\2\2\b\13\3\2\2\2\t\7\3\2"+
-       "\2\2\t\n\3\2\2\2\n\f\3\2\2\2\13\t\3\2\2\2\f\r\7\5\2\2\r\3\3\2\2\2\16\17"+
-       "\7\4\2\2\17\5\3\2\2\2\3\t";
+       "\3\uacf5\uee8c\u4f5d\u8b0d\u4a45\u78bd\u1b2f\u3378\3\5\22\4\2\t\2\4\3"+
+       "\t\3\3\2\7\2\b\n\2\f\2\16\2\13\13\2\3\2\3\2\3\3\3\3\3\3\3\3\2\4\2\4\2"+
+       "\2\20\2\t\3\2\2\2\4\16\3\2\2\2\6\b\5\4\3\2\7\6\3\2\2\2\b\13\3\2\2\2\t"+
+       "\7\3\2\2\2\t\n\3\2\2\2\n\f\3\2\2\2\13\t\3\2\2\2\f\r\7\5\2\2\r\3\3\2\2"+
+       "\2\16\17\6\3\2\2\17\20\7\4\2\2\20\5\3\2\2\2\3\t";
    public static final ATN _ATN =
        ATNSimulator.deserialize(_serializedATN.toCharArray());
    static {

I've debugged both codes.

Supposing an input "A A A" (without semi-colon), the version without semantic predicate goes

            while (_la==A) {
                {
                {
                setState(4); a();
                }
                }
                setState(9);
                _errHandler.sync(this);
                _la = _input.LA(1);
            }

This block 3 times, then proceeds to

            setState(10); match(SEMICOLON);

The match(SEMICOLON) injects a missing token.

Now note that the version with semantic predicate gets rid of _la = _input.LA(1) (lookahead) and switches to a more advanced prediction with _alt = getInterpreter().adaptivePredict(_input,0,_ctx).

With the very same input, the version with semantic predicate goes:

            _alt = getInterpreter().adaptivePredict(_input,0,_ctx);
            while ( _alt!=2 && _alt!=-1 ) {
                if ( _alt==1 ) {
                    {
                    {
                    setState(4); a();
                    }
                    } 
                }
                setState(9);
                _errHandler.sync(this);
                _alt = getInterpreter().adaptivePredict(_input,0,_ctx);
            }

This block 3 times, but it doesn't leave the block unexceptionally. The last _alt = getInterpreter().adaptivePredict(_input,0,_ctx) throws org.antlr.v4.runtime.NoViableAltException, skipping the match(SEMICOLON) entirely.

I think recovery is called but the token injection is not considered because the predicate evaluation cannot be simulated by the recovery. Have you debugged it? Does it enter the recovery in both cases? If so you can rewrite recovery to your needs. — CoronA
@CoronA, I've edited the answer to give some insight on what debugging reveals. The problem is that the match(SEMICOLON) doesn't get a chance to execute, since the adaptativePredict() throws an exception when there is no viable alternative. I think the interpreter is waiting for either "A" or ";" to decide which way to go; but the simpler lookahead(1) strategy doesn't choke on the missing ';' and lets its match() proceed normally (thus injecting the missing token). — rslemos
I've just seen commit ea4676b18ab40c99b629e078f1addd6605988403, which may have fixed this issue. I'll test with 4.5 and come back later to tell. — rslemos
I had the same problem with adaptive predict and error recovery with ANTLR 4.5. And my conclusion is similar to yours. The atn-Prediction breaks the common error recovery. I am not sure if this is a bug or an accepted inconvenience. Perhaps it is worth filing an issue. — CoronA

GRosenberg GRosenberg · Accepted Answer · 2015-05-06T03:52:30

Understand that the DefaultErrorStrategy takes a naive approach to identifying the rule and source of a parsing exceptions.

In particular, evaluating a predicate within the scope of an error recovery routine is sufficiently difficult that it is not done as part of the DefaultErrorStrategy handling.

Consider this variant of your test grammar:

aaaaa   : a* SEMI EOF           ;
a       : ( { true }? B )? A    ;
A   : 'A';
B   : 'B';
SEMI: ';';
WS  : ' '+ -> channel(HIDDEN) ;

On the input AAA, the error message printed is

line 1:5 no viable alternative at input '<EOF>'
([] ([4] A) ([4] A) ([4] A))

Even though the predicated B is optional, there is no simple way to be assured that the predicate is irrelevant. And no simple way to rerun the predicate to evaluate its output within the context of the error recovery operation. The only valid runtime conclusion here then is that the error cannot be identified as existing in just one rule (or sub-rule).

Of course, you can extend the DefaultErrorStrategy to address issues that are specific to your grammar or that are more complex than what the default strategy can handle.

In combination with extending the DefaultErrorStrategy, consider extending the RecognitionException to gain a greater understanding of exactly where and how the originating exception occurred - note the method getExpectedTokens().

As you may appreciate, handling all possible error forms as part of the parsing can become complex. In general, auto-correction within the parser is appropriate where the errors are discrete, well defined and easily identifiable. Otherwise, treat them as semantic errors to be corrected during the analysis phase(s).

ANTLR4 semantic predicates mess with error recovery. Why?

1 Answers