0
votes

I'm trying to write an antlr4 parser rule that can match the content between some arbitrary string values that are same. So far I couldn't find a method to do it.

For example, in the below input, I need a rule to extract Hello and Bye. I'm not interested in extracting xyz though.

TEXT Hello TEXT

TEXT1 Bye TEXT1

TEXT5 xyz TEXT8

As it is very much similar to an XML element grammar, I tried an example for XML Parser given in ANTLR4 XML Grammar, but it parses an input like <ABC> ... </XYZ> without error which is not what I wanted.

I also tried using semantic predicates without much success.

Could anyone please help with a hint on how to match content that is embedded between same strings?

Thank you!

Satheesh

1
What determines the delimiter text? Is it a fixed set of delimiters (e.g. TEXT, TEXT1, but not TEXT5) or is that really arbitrary and must be set by e.g. the application? It's simpler if there is a fixed set, as you can code this into the grammar directly, otherwise you will need a validating semantic predicateMike Lischke
@Mike Thank you for your response. The delimiter is completely arbitrary and not a fixed set. I just need to grab the content that occurs between two same string values. Could you please let me know how to use semantic predicates in this case?Satheesh

1 Answers

0
votes

Not sure how this works out performance wise, because of many many checks the parser has to do, but you could try something like:

token:
    start = IDENTIFIER WORD* end = IDENTIFIER { start == end }?
;

The part between the curly braces is a validating semantic predicate. The lexer tokens are self-explanatory, I believe.

The more I think about it, it might be better you just tokenize the input and write an owner parser that processes the input and acts accordingly. Depends of course on the complexity of the syntax.