ANTLR - find first match for grammar within string

Question

Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?

Conceptually, I could accomplish this by looking for a match at position S[i], incrementing i until I successfully retrieve a match or S is exhausted.

However, in practice this doesn't work very well, because prefixes in S might coincidentally have characters that match tokens in my grammar. Depending on how this happens, a valid string ss in S might get recognized several times, or skipped over erratically, or there might be lots of errors printed about "token recognition error".

Is there an approach I haven't thought of, or an ANTLR feature I don't know about?

I'm using the Python bindings for ANTLR, if that matters.

EXAMPLE:

Given the following grammar:

grammar test ;

options { language=Python3; }


month returns [val]
  : JAN {$val = 1}
  | FEB {$val = 2}
  | MAR {$val = 3}
  | APR {$val = 4}
  | MAY {$val = 5}
  ;

day_number returns [val]
  : a=INT {$val = int($a.text)} ;

day returns [val]
  : day_number WS?     {$val = int($day_number.start.text)}
  ;

month_and_day returns [val]
  : month WS day             {$val = ($month.val, $day.val)}
  | day WS ('of' WS)? month  {$val = ($month.val, $day.val)}
  ;


WS : [ \n\t]+ ;  // whitespace is not ignored

JAN : 'jan' ('.' | 'uary')? ;
FEB : 'feb' ('.' | 'ruary')? ;
MAR : 'mar' ('.' | 'ch')? ;
APR : 'apr' ('.' | 'il')? ;
MAY : 'may' ;


INT
  : [1-9]
  | '0' [1-9]
  | '1' [0-9]
  | '2' [0-3]
  ;

and the following script to test it:

import sys

sys.path.append('gen')


from testParser import testParser
from testLexer import testLexer


from antlr4 import InputStream
from antlr4 import CommonTokenStream, TokenStream


def parse(text: str):
    date_input = InputStream(text.lower())
    lexer = testLexer(date_input)
    stream = CommonTokenStream(lexer)
    parser = testParser(stream)
    return parser.month_and_day()


for t in ['Jan 6',
          'hello Jan 6, 1984',
          'hello maybe Jan 6, 1984']:
    value = parse(t)
    print(value.val)

I get the following results:

# First input - good
(1, 6)

# Second input - errors printed to STDERR
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:11 token recognition error at: ','
(1, 6)

# Third input - prints errors and throws exception
line 1:0 token recognition error at: 'h'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'l'
line 1:3 token recognition error at: 'l'
line 1:4 token recognition error at: 'o '
line 1:9 token recognition error at: 'b'
line 1:10 token recognition error at: 'e'
line 1:12 mismatched input 'jan' expecting INT
Traceback (most recent call last):
  File "test_grammar.py", line 25, in <module>
    value = parse(t)
  File "test_grammar.py", line 19, in parse
    return parser.month_and_day()
  File "gen/testParser.py", line 305, in month_and_day
    localctx._day = self.day()
  File "gen/testParser.py", line 243, in day
    localctx.val = int((None if localctx._day_number is None else localctx._day_number.start).text)
ValueError: invalid literal for int() with base 10: 'jan'
Process finished with exit code 1

To use the incremental approach I outlined above, I'd need a way to suppress the token recognition error output and also wrap the exception in a try or similar. Feels like I'd be very much going against the grain, and it would be difficult to distinguish these parsing exceptions from other things going wrong.

(META - I could swear I already asked this question somewhere about 4 months ago, but I couldn't find anything on SO, or the ANTLR GitHub tracker, or the ANTLR Google Group.)

Is the string S structured, or is it some bit of arbitrary text? What is the full expression of the rule my_rule? Hard to understand para 3 without further details. — GRosenberg
S could be completely arbitrary - in my case it's text that a user can type. I'll add an example rule when I get back to work tomorrow, thanks. — Ken Williams

GRosenberg GRosenberg · Accepted Answer · 2018-07-19T18:49:36

Is there a way to use an ANTLR parser as a searcher, i.e. to find the first instance of a substring ss of a longer string S that matches a given rule my_rule?

The short answer is no. ANTLR does not work as a substitute/equivalent to any of the standard regex-based tools, like sed and awk.

The longer answer is yes, but with messy caveats. ANTLR expects to parse a structured, largely unambiguous input text. Text that is of no semantic significance can be ignored by adding the lexer rule (at lowest priority/bottom position)

IGNORE : . -> skip;

That way, anything not explicitly recognized in the lexer is ignored.

The next problem is the potential semantic overlap between 'normal' text and keywords, e.g. Jan (name) - Jan (month abrev). In general, this can be handled by adding a BaseErrorListener to the parser to distinguish between real and meaningless errors. What constitutes real vs meaningless can involve complex corner cases depending on the application.

Finally, the rule

day_number returns [val]
  : a=INT {$val = int($a.text)} ;

is returning an int value not an INT token, hence the error that is being reported. The rule should be

day_number : INT ;

ANTLR - find first match for grammar within string

2 Answers