Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?
Conceptually, I could accomplish this by looking for a match at position S[i], incrementing i until I successfully retrieve a match or S is exhausted.
However, in practice this doesn't work very well, because prefixes in S might coincidentally have characters that match tokens in my grammar. Depending on how this happens, a valid string ss in S might get recognized several times, or skipped over erratically, or there might be lots of errors printed about "token recognition error".
Is there an approach I haven't thought of, or an ANTLR feature I don't know about?
I'm using the Python bindings for ANTLR, if that matters.
EXAMPLE:
Given the following grammar:
grammar test ;
options { language=Python3; }
month returns [val]
: JAN {$val = 1}
| FEB {$val = 2}
| MAR {$val = 3}
| APR {$val = 4}
| MAY {$val = 5}
;
day_number returns [val]
: a=INT {$val = int($a.text)} ;
day returns [val]
: day_number WS? {$val = int($day_number.start.text)}
;
month_and_day returns [val]
: month WS day {$val = ($month.val, $day.val)}
| day WS ('of' WS)? month {$val = ($month.val, $day.val)}
;
WS : [ \n\t]+ ; // whitespace is not ignored
JAN : 'jan' ('.' | 'uary')? ;
FEB : 'feb' ('.' | 'ruary')? ;
MAR : 'mar' ('.' | 'ch')? ;
APR : 'apr' ('.' | 'il')? ;
MAY : 'may' ;
INT
: [1-9]
| '0' [1-9]
| '1' [0-9]
| '2' [0-3]
;
and the following script to test it:
import sys
sys.path.append('gen')
from testParser import testParser
from testLexer import testLexer
from antlr4 import InputStream
from antlr4 import CommonTokenStream, TokenStream
def parse(text: str):
date_input = InputStream(text.lower())
lexer = testLexer(date_input)
stream = CommonTokenStream(lexer)
parser = testParser(stream)
return parser.month_and_day()
for t in ['Jan 6',
'hello Jan 6, 1984',
'hello maybe Jan 6, 1984']:
value = parse(t)
print(value.val)
I get the following results:
# First input - good
(1, 6)
# Second input - errors printed to STDERR
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:11 token recognition error at: ','
(1, 6)
# Third input - prints errors and throws exception
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:9 token recognition error at: 'b'
line 1:10 token recognition error at: 'e'
line 1:12 mismatched input 'jan' expecting INT
Traceback (most recent call last):
File "test_grammar.py", line 25, in <module>
value = parse(t)
File "test_grammar.py", line 19, in parse
return parser.month_and_day()
File "gen/testParser.py", line 305, in month_and_day
localctx._day = self.day()
File "gen/testParser.py", line 243, in day
localctx.val = int((None if localctx._day_number is None else localctx._day_number.start).text)
ValueError: invalid literal for int() with base 10: 'jan'
Process finished with exit code 1
To use the incremental approach I outlined above, I'd need a way to suppress the token recognition error output and also wrap the exception in a try or similar. Feels like I'd be very much going against the grain, and it would be difficult to distinguish these parsing exceptions from other things going wrong.
(META - I could swear I already asked this question somewhere about 4 months ago, but I couldn't find anything on SO, or the ANTLR GitHub tracker, or the ANTLR Google Group.)
Sstructured, or is it some bit of arbitrary text? What is the full expression of the rulemy_rule? Hard to understand para 3 without further details. - GRosenbergScould be completely arbitrary - in my case it's text that a user can type. I'll add an example rule when I get back to work tomorrow, thanks. - Ken Williams