ANTLR4 parsing subrules

Question

I have a grammar that works fine when parsing in one pass (entire file).

Now I wish to break the parsing up into components. And run the parser on subrules. I ran into an issue I assume others parsing subrules will see with the following rule:

thing   :   LABEL? THING  THINGDATA thingClause?
            //{System.out.println("G4 Lexer/parser thing encountered");}
        ;
...
thingClause : ',' ID ( ',' ID)?
            ;

When the above rule is parsed from a top level start rule which parses to EOF everything works fine. When parsed as a sub-rule (not parse to EOF) the parser gets upset when there is no thing clause, as it is expecting to see EITHER a "," character or an EOF character.

line 8:0 mismatched input '%' expecting {, ','}

When I parse to EOF, the % gets correctly parsed into another "thing" component, because the top level rule looks for:

  toprule :  thing+
          |  endOfThingsTokens
          ;

And endOfThingsTokens occurs before EOF... so I expect this is why the top level rule works.

For parsing the subrule, I want the ANTLR4 parser to accept or ignore the % token and say "OK we aren't seeing a thingClause", then reset the token stream so the next thing object can be parsed by a different instance of the parser.

In this specific case I could change the lexer to pass newlines to the parser, which I currently skip in the lexer grammar. That would require lots of other changes to accept newlines in the token stream which are currently not needed.

Essentially I need some way to make the rule have a "end of record" token. But I was wondering if there was some way to solve this with a semantic predicate rule.

something like:

    thing   :   { if comma before %}? LABEL? THING  THINGDATA thingClause?
            | LABEL? THING THINGDATA
            ;
    ...

    thingClause : ',' ID ( ',' ID)?
            ;

The above predicate pseudo code would hide the optional thingClause? if it won't be satisfied so that the parser would stop after parsing one "thing" without looking for a specific "end of thing" token (i.e. newline).

If I solve this I will post the answer.

GRosenberg GRosenberg · Accepted Answer · 2017-03-01T22:19:48

The parser will (effectively) look-ahead in the token stream to determine if the current rule can be satisfied. The corresponding tokens are then consumed. If any look-ahead tokens remain unconsumed, the parser looks for another rule against which to consume these and additional look-ahead tokens.

The thingClause? element, when not matched, will result in unconsumed tokens in the parser. Hence the error you are seeing.

The parser look-ahead is data dependent. Meaning that the evaluation of the elements of a rule can easily read into the parser more tokens than the current rule could possibly consume.

While a predicate could help, it will not make the problem deterministic. That is, even if the parser matches the non-predicated alt, it may well have read more tokens into the parser than can be consumed by that alt.

The only way to avoid this non-determinism would be to pre-inject <EOF> tokens into the token stream at the sub-rule boundaries.

ANTLR4 parsing subrules

1 Answers