Match up to a specific number of repetitions in a non-greedy way in ANTLR

Question

In my grammar I have something like this:

line : startWord (matchPhrase|
                  anyWord matchPhrase|
                  anyWord anyWord matchPhrase|
                  anyWord anyWord anyWord matchPhrase|
                  anyWord anyWord anyWord anyWord matchPhrase) 
       -> ^(TreeParent startWord anyWord* matchPhrase);

So I want to match the first occurrence of matchPhrase, but I will allow up to a certain number of anyWord before it. The tokens that make up matchPhrase are also matched by anyWord.

Is there a better way of doing this?

I think it might be possible by combining the semantic predicate in this answer with the non-greedy option:

(options {greedy=false;} : anyWord)*

but I can't figure out exactly how to do this.

Edit: Here's an example. I want to extract information from the following sentences:

Picture of a red flower.

Picture of the following: A red flower.

My input is actually tagged English sentences, and the Lexer rules match the tags rather than the words. So the input to ANTLR is:

NN-PICTURE Picture IN-OF of DT a JJ-COLOR red NN-FLOWER flower

NN-PICTURE Picture IN-OF of DT the VBG following COLON : DT a JJ-COLOR red NN-FLOWER flower

I have lexer rules for each tag like this:

WS :  (' ')+ {skip();};
TOKEN : (~' ')+;

nnpicture:'NN-PICTURE' TOKEN -> ^('NN-PICTURE' TOKEN);
vbg:'VBG' TOKEN -> ^('VBG' TOKEN);

And my parser rules are something like this:

sentence : nnpicture inof matchFlower;

matchFlower : (dtTHE|dt)? jjcolor? nnflower;

But of course this will fail on the second sentence. So I want to allow a bit of flexibility by allowing up to N tokens before the flower match. I have an anyWord token that matches anything, and the following works:

sentence :  nnpicture inof ( matchFlower | 
                             anyWord matchFlower |
                             anyWord anyWord matchFlower | etc.

but it isn't very elegant, and doesn't work well with large N.

@BartKiers: Sorry I didn't explain it that well - matchPhrase is a subset of anyWord, so there could be a number of words that aren't in matchPhrase before matchPhrase, and they would be matched by anyWord. But because it is a subset, the anyWord match needs to be non-greedy otherwise the matchPhrase words will be matched by anyWord. Hence why I can't do anyWord? anyWord? anyWord? matchPhrase. — Matt Swain
@Matt, I see what you mean. If someone doesn't do so before me, I'll answer you this evening (I'm at work ATM). — Bart Kiers

Bart Kiers Bart Kiers · Accepted Answer · 2012-03-14T19:40:18

You can do that by first checking inside the matchFlower rule if there really is dt? jjcolor? nnflower ahead in its token-stream using a syntactic predicate. If such tokens can be seen, simply match them, if not, match any token, and recursively match matchFlower. This would look like:

matchFlower
 : (dt? jjcolor? nnflower)=> dt? jjcolor? nnflower -> ^(FLOWER dt? jjcolor? nnflower)
 |                           . matchFlower         -> matchFlower
 ;

Note that the . (dot) inside a parser rule does not match any character, but any token.

Here's a quick demo:

grammar T;

options {
  output=AST;
}

tokens {
  TEXT;
  SENTENCE;
  FLOWER;
}

parse
 : sentence+ EOF -> ^(TEXT sentence+)
 ;

sentence
 : nnpicture inof matchFlower -> ^(SENTENCE nnpicture inof matchFlower)
 ;

nnpicture
 : NN_PICTURE TOKEN -> ^(NN_PICTURE TOKEN)
 ;

matchFlower
 : (dt? jjcolor? nnflower)=> dt? jjcolor? nnflower -> ^(FLOWER dt? jjcolor? nnflower)
 |                           . matchFlower         -> matchFlower
 ;

inof
 : IN_OF (t=IN | t=OF) -> ^(IN_OF $t)
 ;

dt
 : DT (t=THE | t=A) -> ^(DT $t)
 ;

jjcolor
 : JJ_COLOR TOKEN -> ^(JJ_COLOR TOKEN)
 ;

nnflower
 : NN_FLOWER TOKEN -> ^(NN_FLOWER TOKEN)
 ;

IN_OF      : 'IN-OF';
NN_FLOWER  : 'NN-FLOWER';
DT         : 'DT';
A          : 'a';
THE        : 'the';
IN         : 'in';
OF         : 'of';
VBG        : 'VBG';
NN_PICTURE : 'NN-PICTURE';
JJ_COLOR   : 'JJ-COLOR';
TOKEN      : ~' '+;
WS         : ' '+ {skip();};

A parser generated from the grammar above would parse your input:

NN-PICTURE Picture IN-OF of DT the VBG following COLON : DT a JJ-COLOR red NN-FLOWER flower

as follows:

enter image description here

As you can see, everything before the flower is omitted from the tree. If you want to keep these tokens in there, do something like this:

grammar T;

// ...

tokens {
  // ...
  NOISE;
}

// ...

matchFlower
 : (dt? jjcolor? nnflower)=> dt? jjcolor? nnflower -> ^(FLOWER dt? jjcolor? nnflower)
 |                           t=. matchFlower       -> ^(NOISE $t) matchFlower
 ;

// ...

resulting in the following AST:

enter image description here

Match up to a specific number of repetitions in a non-greedy way in ANTLR

1 Answers