2
votes

Here is a subset of the language I want to parse:

  • A program consists of statements
  • A statement is an assignment: A = "b"
  • Assignment's left side is an identifier (all caps)
  • Assignment's right side is a string enclosed by quotation marks
  • A string supports string interpolation by inserting a bracket-enclosed identifier (A = "b[C]d")

So far this is straight forward enough. Here is what works:

Lexer:

lexer grammar string_testLexer;

STRING_START: '"' -> pushMode(STRING);
WS: [ \t\r\n]+  -> skip ;
ID: [A-Z]+;
EQ: '=';

mode STRING;

VAR_START: '[' -> pushMode(INTERPOLATION);
DOUBLE_QUOTE_INSIDE: '"' -> popMode;
REGULAR_STRING_INSIDE: ~('"'|'[')+;


mode INTERPOLATION;
ID_INSIDE: [A-Z]+;
CLOSE_BRACKET_INSIDE: ']' -> popMode;

Parser:

parser grammar string_testParser;

options { tokenVocab=string_testLexer; }

mainz: stat *;
stat: ID EQ string;

string: STRING_START string_part* DOUBLE_QUOTE_INSIDE;
string_part: interpolated_var | REGULAR_STRING_INSIDE;
interpolated_var: VAR_START ID_INSIDE CLOSE_BRACKET_INSIDE;

So far so good. However there is one more language feature:

  • if there is no valid identifier (that is all caps) in the brackets, treat as normal string.

Eg:

A = "hello" => "hello"
B = "h[A]a" => "h", A, "a"
C="h [A] a" => "h ", A, " a"
D="h [A][V] a" => "h ", A, V, " a"
E = "h [A] [V] a" => "h ", A, " ", V, " a"
F = "h [aVd] a" => "h [aVd] a"
G = "h [Va][VC] a" => "h [Va]", VC, " a"
H = "h [V][][ff[Z]" => "h ", V, "[][ff", Z

I tried to replace REGULAR_STRING_INSIDE: ~('"'|'[')+; With just REGULAR_STRING_INSIDE: ~('"')+;, but that does not work in ANTLR. It results in matching all the lines above as strings.

Since in ANTLR4 there is no backtracking to enable I'm not sure how to overcome this and tell ANTLR that if it did not match the interpolated_var rule it should go ahead and match REGULAR_STRING_INSIDE instead, it seems to always chose the latter.

I read that lexer always matches the longest token, so I tried to lift REGULAR_STRING_INSIDE and VAR_START as a parser rules, hoping that alternatives order in the parser will be honoured:

r: REGULAR_STRING_INSIDE
v: VAR_START

string: STRING_START string_part* DOUBLE_QUOTE_INSIDE;
string_part: v ID_INSIDE CLOSE_BRACKET_INSIDE | r;

That did not seem to make any difference at all.

I also read that antlr4 semantic predicates could help. But I have troubles coming up with the ones that needs to be applied in this case.

How do I modify this grammar above so that it can match both interpolated bits, or treat them as strings if they are malformed?

Test input:

A = "hello"
B = "h[A]a"
C="h [A] a"
D="h [A][V] a"
E = "h [A] [V] a"
F = "h [aVd] a"
G = "h [Va][VC] a"
H = "h [V][][ff[Z]"

How I compile / test:

antlr4 string_testLexer.g4
antlr4 string_testParser.g4
javac *.java
grun string_test mainz st.txt -tree
1

1 Answers

1
votes

I tried to replace REGULAR_STRING_INSIDE: ~('"'|'[')+; With just REGULAR_STRING_INSIDE: ~('"')+;, but that does not work in ANTLR. It results in matching all the lines above as strings.

Correct, ANTLR tries to match as much as possible. So ~('"')+ will be far too greedy.

I also read that antlr4 semantic predicates could help.

Only use predicates as a last resort. It introduces target specific code in your grammar. If it's not needed (which in this case it isn't), then don't use them.

Try something like this:

REGULAR_STRING_INSIDE
 : ( ~( '"' | '[' )+ 
   | '[' [A-Z]* ~( ']' | [A-Z] ) 
   | '[]'
   )+
 ;

The rule above would read as:

  1. match any char other than " or [ once or more
  2. OR match a [ followed by zero or more capitals, followed by any char other than ] or a capital (your [Va and [aVd cases)
  3. OR match an empty block, []

And match one of these 3 alternatives above once or more to create a single REGULAR_STRING_INSIDE.

And if a string can end with one or mote [, you may also want to do this:

DOUBLE_QUOTE_INSIDE
 : '['* '"' -> popMode
 ;