Match sub-rule backwards in ANTLR4 parser

Question

I have a portion of an ANTLR4 rule that I'd like to parse backwards. I suspect that's not the real solution, so there's likely something I'm missing.

The crux of my problem is that there's a part in the middle of my expression that I'd like to extract. However, this part has some (defined) suffixes that I would really like to extract separately, if possible. These suffixes can be separated by a comma or not; the grammar works fine with the comma, but if the comma is missing, it takes the entire part as unknown, even if the suffixes are present.

I've pared down my grammar into a small example, visible at the bottom of this post.

Given the string why hello, x y z foo bar baz blah blah blah, goodbye!, my grammar will parse x y z foo bar baz as a phrase. I would like to match x y z as unknown and foo bar baz as suffixes. If there is a comma (x y z, foo bar baz), it works: tree generated with comma

However, if there is no comma, it takes the entire x y z foo bar baz (as well as some of the text after) as unknown: tree generated with no comma

I tried changing unknown to be nongreedy (+?), but that is undesirable as well, consuming only one token for phrase: tree generated with no comma and nongreedy unknown

Is there a way to force the phrase rule to try matching suffixes from the right first before falling back to unknown?

Another way to put it: is there a way to have unknown match anything except when it ends with one or more suffixes? (The suffixes can appear in the text as long as they're not at the end)

Example grammar:

grammar Example;

// parse tree root
exampleExpression : ignored HELLO separator phrase separator? unknown separator? GOODBYE ignored;

// what I want to match
phrase : unknown (COMMA? suffix+)*;

// convenience rule for swaths of tokens to be ignored (e.g. at the beginning and end)
ignored : (unknown | separator)*;

// roll up unknown tokens under one rule
unknown : (~(PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH))+;
separator : PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH;

// the pre-defined suffixes
suffix : FOO | BAR | BAZ;

/* TOKENS */

HELLO : 'hello';
GOODBYE : 'goodbye';
FOO : 'foo';
BAR : 'bar';
BAZ : 'baz';

/* FRAGMENTS */

fragment DIGIT : [0-9];
fragment DASH : '-';

/* REMAINING TOKENS */

LPAREN : '(' ;
RPAREN : ')' ;
COMMA : ',';
PERIOD : '.';
PIPE : '|';
BULLET : '\u00B7' | '\u2219' | '\u22c5';
SP_SEP_DASH : SP DASH SP;

SP : [ \u000B\t\r\n] -> channel(HIDDEN);

NUMBER : ([0] | [1-9] DIGIT*) ('.' DIGIT+)?;
WORD : [A-Za-z] [A-Za-z-]*;

// catch-all
OTHER : .;

Wild guess: phrase : unknown COMMA? (suffix+)*; instead of phrase : unknown (COMMA? suffix+)*;. — 500 - Internal Server Error
@500-InternalServerError: Surely (suffix+)* is the same as suffix*, but I suppose what is really wanted is suffix+, possibly with interspersed COMMAs. IOW: COMMA? (suffix+ COMMA)* suffix+. But I don't think that is the fundamental problem here. — rici
My point was that phrase only appears to accept suffix if COMMA is also present. — 500 - Internal Server Error
Right, phrase has (COMMA? suffix+)* to support phrases like x y z, foo, bar, baz. Suffixes are not always present, though, hence * instead of +. FWIW, if I were to ignore the multiple-comma case, unknown COMMA? suffix* exhibits the same problems. — NickAldwin

rici rici · Accepted Answer · 2014-09-11T23:17:08

The question says:

Another way to put it: is there a way to have unknown match anything except when it ends with one or more suffixes? (The suffixes can appear in the text as long as they're not at the end)

But previously, a parse of unknown with internal suffixes was rejected:

However, if there is no comma, it takes the entire x y z foo bar baz (as well as some of the text after) as unknown

That seems inconsistent.

From the example, it seems like you are trying to do natural language parsing; ANTLR, whatever its virtues, is probably not a good tool for that. But that might just be a chimera based on your simplification.

In any event, the answer to your original question -- "is it possible to define a non-terminal as any sequence of tokens which don't end with one or more tokens from a suffix class" is "yes, that can be written as a context-free-grammar". Without getting into ANTLR specifics, here's a simple CFG:

wordlist: /* empty */ | wordlist non_suffix | wordlist suffix_list non_suffix ;
suffix_list: suffix | suffix_list suffix ;

Match sub-rule backwards in ANTLR4 parser

1 Answers