I'm trying to use ANTLR4 to parse input strings that are described by a grammar like:
grammar MyGrammar;
parse : PREFIX? SEARCH;
PREFIX
: [0-9]+ ':'
;
SEARCH
: .+
;
e.g. valid input strings include:
0: maracujá
apple
3:€53.60
1: 10kg
2:chilli pepper
But the SEARCH
rule always matches the whole string - whether it has a prefix or not.
I understand this is because the ANTLR4 lexer gives preference to the rules that match the longest string. Therefore the SEARCH
rule matches all input, not giving the PREFIX
rule a chance.
And the non-greedy version (i.e. SEARCH : .+? ;
) has the same problem because (as I understand) it's only non-greedy within the rule - and the SEARCH
rule doesn't have any other parts to constrain it.
If it helps, I could constrain the SEARCH
text to exclude ':'
but I really would prefer it recognise anything else - unicode characters, symbols, numbers, space etc.
I've read Lexer to handle lines with line number prefix but in that case, the body of the string (after the prefix) is significantly more constrained.
Note: SEARCH
text might have a structure to it - like €53.00
and 10kg
above (which I'd also like ANTLR4 to parse) or it might just be free text - like apple
, maracujá
and chilli pepper
above. But I've tried to simplify so I can solve the problem of extracting the PREFIX
first.
SEARCH
rule, because your grammar would be ambiguous.0: x 1: y
could be tokenized as eitherPREFIX SEARCH PREFIX SEARCH
orPREFIX SEARCH
. – Mephy0: x 1: y
would be aPREFIX
of0:
andSEARCH
ofx 1: y
- so there's only ever onePREFIX
and everything that follows is theSEARCH
. – Joachim Chapman