Are scannerless parser grammars still supported in ANTLR4?

Question

I have a scannerless parser grammar utilizing the CharsAsTokens faux lexer which generates a usable Java Parser class for ANTLR4 versions through 4.6. But when updating to ANTLR 4.7.2 through 4.9.3-SNAPSHOT, the tool generates code producing dozens of compilation errors from the same grammar file, as detailed below.

My question here is simply: Are scannerless parser grammars no longer supported, or must their character-based terminals be specified differently in 4.7 and beyond?

Update:

Unfortunately, I cannot post my complete grammar here as it is derived from FOUO security marking guidance, access to which is retricted by the U.S. government (I am a DoD/IC contractor).

The incompatible upgrade issue however is entirely reproducible with the CSQL.g4 scannerless parser grammar example referred to by Ter in Section 5.6 of The Definitive ANTLR 4 Reference.

As does my grammar, the CSQL example uses CharsAsTokens.java for its tokenizer, and CharVocab.tokens as its token vocabulary.

Note that every token name is specified by its ASCII character-literal equivalent, as in:

'\*'=42
'+'=43

and that the parser grammar references quoted token names directly within its rules, as in:

star: '*' ws? ;
plus: '+' ws? ;

The issue here is that using ANTLR4 versions 4.2 though 4.6 generated compilable parser classes from such grammars, while ANTLR v4.7.2 and beyond generate Java code with numerous errors.

Here is a snippit from the usable CSQL Java class definition generated by ANTLR v4.6:

 public static class ArgsContext extends ParserRuleContext {
      public List<ArgContext> arg() {
          return getRuleContexts(ArgContext.class);
      }
      public ArgContext arg(int i) {
          return getRuleContext(ArgContext.class,i);
      }
      public ArgsContext(ParserRuleContext parent, int invokingState) {
          super(parent, invokingState);
      }
      @Override public int getRuleIndex() { return RULE_args; }
      @Override
      public void enterRule(ParseTreeListener listener) {
          if ( listener instanceof CSQLListener ) ((CSQLListener)listener).enterArgs(this);
      }
      @Override
      public void exitRule(ParseTreeListener listener) {
          if ( listener instanceof CSQLListener ) ((CSQLListener)listener).exitArgs(this);
      }
 }

And here is the corresponding but now broken code generated by ANTLR v4.7.2:

 public static class ArgsContext extends ParserRuleContext {
      public List<ArgContext> arg() {
          return getRuleContexts(ArgContext.class);
      }
      public ArgContext arg(int i) {
          return getRuleContext(ArgContext.class,i);
      }
      public List<TerminalNode> ','() { return getTokens(CSQL.','); }   // line 446
      public TerminalNode ','(int i) {                                  // line 447
          return getToken(CSQL.',', i);                                 // line 448
      }                                                                 // line 449
      public ArgsContext(ParserRuleContext parent, int invokingState) {
          super(parent, invokingState);
      }
      @Override public int getRuleIndex() { return RULE_args; }
      @Override
      public void enterRule(ParseTreeListener listener) {
          if ( listener instanceof CSQLListener ) ((CSQLListener)listener).enterArgs(this);
      }
      @Override
      public void exitRule(ParseTreeListener listener) {
          if ( listener instanceof CSQLListener ) ((CSQLListener)listener).exitArgs(this);
      }
 }

The numbered lines above are generated only by the newer ANTLR tools (without the added comments), and when compiled result in:

Syntax error on token "','", Identifier expected  CSQL.java     /CSQL/generated-sources  line 446  Java Problem
Syntax error on token "','", delete this token    CSQL.java     /CSQL/generated-sources  line 447  Java Problem
CSQL cannot be resolved to a variable   CSQL.java /CSQL/generated-sources     line 448  Java Problem
Syntax error on token ".", , expected   CSQL.java /CSQL/generated-sources     line 448  Java Problem

So why the backwards-incompatible change in ANTLR v4.7+, and how best should I work around it?

Could you post a complete grammar that demonstrates your problem? — Bart Kiers
Side note: how comes you call that "scannerless"? The CharAsTokens is a simplified token source, but still a token source (and hence, a scanner). Your problem is probably not about support for "scannerless parsers" (whatever that means in this context), but some changes in newer versions of ANTLR that produce syntax errors in the generated code. So please give us examples of the errors and the relevant parts of your grammar. — Mike Lischke
You can't define the string literal in the tokens file and use the string literal in the parser grammar file. I.e., in your .tokens file ' '=1 '\n'=2 '\r'=3 ... and parser file parser grammar ArithmeticParser; options { tokenVocab = ArithmeticLexer; } ws: ( ' ' | '\r' | '\n' )+;. You're going to have to use the token name instead of the literals (ws: (SP | CR | LF);), then you can use 4.9.2. Ideally, you just have a lexer grammar to declare all this, but just don't use the lexer. Then my trfoldlit of Trash can make the changes to your parser grammar automatically. Or just do it by hand. — kaby76
(Actually, trfoldlit program can't handle split grammar literal unfolding--yet. A bug I'll fix. For now, you'll have to just unfold the string literals in your parser grammar manually, and declare the token names in the .tokens file.) — kaby76
@Bart Kiers I have updated the problem statement with complete examples of a simpler but similar grammar, along with the identical tokenizer and token vocabulary used by my code. — coder

Mike Cargal Mike Cargal · Accepted Answer · 2021-06-06T14:08:52

Try defining a GrammarLexer.g4 file instead of the GrammarLexer.tokens file. (You'd still using the options: { tokenVocab = GrammarLexer; } like you do if you create the GrammarLexer.tokens file} It could be as simple as:

T1 : ' ';
T2 : '\n';
T3 : '\r';
T4 : 'a';
T5 : 'b';

This will create the token names for you. Antlr will allow you to have the 'a', '\n', etc. in your parser grammar rules, but will match them up with the lexer rules names in the Lexer grammar and use that name (ex: T4 when you have 'a' in your rules, and T2 when you have '\n') so that it will compile clean. You won't have to use the lexer, as long as your CharsAsTokens produce the same token values. (though, thinking about it, that lever would probably the equivalent of the CharsAsTokens tokenizer you're using, and would guarantee that the token numbers match up.)

It seems this would still achieve your goal of tokens being just a stream of characters, and handling everything in the parser rules. (and would not really be any more onerous than generating the *.tokens file. Both would need to be an exhaustive list of all valid chars.)

Are scannerless parser grammars still supported in ANTLR4?

1 Answers