I have a portion of an ANTLR4 rule that I'd like to parse backwards. I suspect that's not the real solution, so there's likely something I'm missing.
The crux of my problem is that there's a part in the middle of my expression that I'd like to extract. However, this part has some (defined) suffixes that I would really like to extract separately, if possible. These suffixes can be separated by a comma or not; the grammar works fine with the comma, but if the comma is missing, it takes the entire part as unknown
, even if the suffixes are present.
I've pared down my grammar into a small example, visible at the bottom of this post.
Given the string why hello, x y z foo bar baz blah blah blah, goodbye!
, my grammar will parse x y z foo bar baz
as a phrase
. I would like to match x y z
as unknown
and foo bar baz
as suffixes. If there is a comma (x y z, foo bar baz
), it works:
However, if there is no comma, it takes the entire x y z foo bar baz
(as well as some of the text after) as unknown
:
I tried changing unknown
to be nongreedy (+?
), but that is undesirable as well, consuming only one token for phrase
:
Is there a way to force the phrase
rule to try matching suffixes from the right first before falling back to unknown
?
Another way to put it: is there a way to have unknown
match anything except when it ends with one or more suffixes? (The suffixes can appear in the text as long as they're not at the end)
Example grammar:
grammar Example;
// parse tree root
exampleExpression : ignored HELLO separator phrase separator? unknown separator? GOODBYE ignored;
// what I want to match
phrase : unknown (COMMA? suffix+)*;
// convenience rule for swaths of tokens to be ignored (e.g. at the beginning and end)
ignored : (unknown | separator)*;
// roll up unknown tokens under one rule
unknown : (~(PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH))+;
separator : PERIOD | COMMA | PIPE | BULLET | SP_SEP_DASH;
// the pre-defined suffixes
suffix : FOO | BAR | BAZ;
/* TOKENS */
HELLO : 'hello';
GOODBYE : 'goodbye';
FOO : 'foo';
BAR : 'bar';
BAZ : 'baz';
/* FRAGMENTS */
fragment DIGIT : [0-9];
fragment DASH : '-';
/* REMAINING TOKENS */
LPAREN : '(' ;
RPAREN : ')' ;
COMMA : ',';
PERIOD : '.';
PIPE : '|';
BULLET : '\u00B7' | '\u2219' | '\u22c5';
SP_SEP_DASH : SP DASH SP;
SP : [ \u000B\t\r\n] -> channel(HIDDEN);
NUMBER : ([0] | [1-9] DIGIT*) ('.' DIGIT+)?;
WORD : [A-Za-z] [A-Za-z-]*;
// catch-all
OTHER : .;
phrase : unknown COMMA? (suffix+)*;
instead ofphrase : unknown (COMMA? suffix+)*;
. – 500 - Internal Server Error(suffix+)*
is the same assuffix*
, but I suppose what is really wanted issuffix+
, possibly with interspersed COMMAs. IOW:COMMA? (suffix+ COMMA)* suffix+
. But I don't think that is the fundamental problem here. – riciphrase
only appears to acceptsuffix
ifCOMMA
is also present. – 500 - Internal Server Errorphrase
has(COMMA? suffix+)*
to support phrases likex y z, foo, bar, baz
. Suffixes are not always present, though, hence*
instead of+
. FWIW, if I were to ignore the multiple-comma case,unknown COMMA? suffix*
exhibits the same problems. – NickAldwin