Parse string antlr

Question

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".

string
 : '"' character* '"'
 ;

character
 : escapeSequence
 | .
 ;

escapeSequence
 : '\(' expression ')'
 ;

IDENTIFIER
 : [a-zA-Z][a-zA-Z0-9]*
 ;

WHITESPACE
 : [ \r\t,] -> skip
 ;

This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.

How can I parse strings that can have expressions inside of them?

Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.

@sepp2k I'm not sure how they would help, if there is a string inside the string and a parenthesis in it that would require writing the entire parser into the tokenizer unless I'm missing something — pfg
Related: stackoverflow.com/questions/37916661/antlr-string-interpolation Also: thosakwe.com/parsing-string-interpolations-with-antlr4 — sepp2k
I would not use the lexer or parser to do string interpolation at all. Parse the string as one entitity and then just do a search and replace on the interpolation values, in a post processing step. IMO that's way better than inventing complicated interpolation grammar rules. Side note: parsing strings in parser won't work, as you would allow spaces between quotes and content then, which don't appear in the output. — Mike Lischke
@MikeLischke Assuming arbitrary expressions are allowed between the \(), it isn't as simple as that. You'd either have to invoke the parser again after the \( or repeat a lot of the parser's work in post-processing. — sepp2k

sepp2k sepp2k · Accepted Answer · 2018-11-28T17:22:28

This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().

The most basic way to achieve this would be like this:

Lexer:

lexer grammar StringLexer;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;

mode IN_STRING;

TEXT: ~[\\"]+ ;

BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);

ESCAPE_SEQUENCE: '\\' . ;

DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;

Parser:

parser grammar StringParser;

options {
    tokenVocab = 'StringLexer';
}

start: exp EOF ;

exp : '(' exp ')'
    | IDENTIFIER
    | DQUOTE stringContents* DQUOTE
    ;

stringContents : TEXT
               | ESCAPE_SEQUENCE
               | '\\(' exp ')'
               ;

Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.

This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.

To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):

@members {
    int nesting = 0;
}

LPAR: '(' {
    nesting++;
    pushMode(DEFAULT_MODE);
};
RPAR: ')' {
    if (nesting > 0) {
        nesting--;
        popMode();
    }
};

mode IN_STRING;

BACKSLASH_PAREN: '\\(' {
    nesting++;
    pushMode(DEFAULT_MODE);
};

(The parts I left out are the same as in the previous version).

This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).

If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:

lexer grammar StringLexer;

IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';

mode IN_STRING;

TEXT: ~[\\"]+ ;

BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);

ESCAPE_SEQUENCE: '\\' . ;

DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;

mode EMBEDDED;

E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;

Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).

Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.

The full code for all three alternatives can also be found on GitHub.

Parse string antlr

1 Answers