Antlr4 DM string lexer rules

Question

I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:

The string start and end with double quotes. i.e. "hello world" evaluates to hello world
A backslash acts as an escape character, which can escape the end quote. i.e. "hello\"world" evaluates to hello"world
Newlines in the string can be ignored by ending the line with a backslash. i.e. "hello\ world" evaluates to helloworld
If the string opens/closes with the sequence {"/"} respectively, newlines are allowed and entered into the final string. The sequence \\\n is still ignored
The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e. "hello [ "world" ] \[" evaluates to hello world [ at run-time. Any expression can go in the braces (calls, math, etc...)
If the starting quote/curly brace is prefixed with '@' escape sequences and embedded expressions are disabled for the string. i.e. @{"hello [worl\d"} and @"hello [worl\d" both evaluate to hello [worl\d

I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:

Normal string. i.e "hello world", @"hello world", @{"hello world"} or {"hello world"}
String start before embedded expression. i.e. "hello [ or {"hello [
String end after embedded expression. i.e. ] world" or ] world"}
String in between two embedded expressions. i.e. ] hello world [

Here are my (incomplete and unsuccessful) attempts:

LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"'; 
CSTRING: ']' ('\\[' | ~[[\r\n])* '['; 
FSTRING: '"' ('\\"' | ~["\r\n])* '"';

If this can't be solved in the lexer, I can write the parser rules on my own with the tokens @, {", "}, [, ], \\, and ". But, I figure I'd give this a shot since it'd be more performant.

Cyberboss Cyberboss · Accepted Answer · 2018-12-11T19:21:56

I solved it with the following lexer tidbits. Permalink

...
@lexer::members
{
ulong regularAccessLevel;
System.Collections.Generic.Stack<bool> multiString = new System.Collections.Generic.Stack<bool>();
}
...
VERBATIUM_STRING: '@"' (~["\r\n])* '"';
MULTILINE_VERBATIUM_STRING: '@{"' (~'"')* '"}';
MULTI_STRING_START: '{"' { multiString.Push(true); } -> pushMode(INTERPOLATION_STRING);
STRING_START: '"' { multiString.Push(false); } -> pushMode(INTERPOLATION_STRING);
...
LBRACE: '[' { ++regularAccessLevel; };
RBRACE: ']' { if(regularAccessLevel > 0) --regularAccessLevel; else if(multiString.Count > 0) { PopMode(); } };
...
mode INTERPOLATION_STRING;
CHAR_INSIDE: '\\\''
    | '\\"'
    | '\\['
    | '\\\\'
    | '\\0'
    | '\\a'
    | '\\b'
    | '\\f'
    | '\\n'
    | '\\r'
    | '\\t'
    | '\\v'
    ;

EMBED_START: '[' -> pushMode(DEFAULT_MODE);
MULTI_STRING_CLOSE: {multiString.Peek()}? '"}' { multiString.Pop(); PopMode(); };
STRING_CLOSE: {!multiString.Peek()}? '"' { multiString.Pop(); PopMode(); };
STRING_INSIDE: {!multiString.Peek()}? ~('[' | '\\' | '"' | '\r' | '\n')+;
MULTI_STRING_INSIDE: {multiString.Peek()}? ~('[' | '\\' | '"')+;

Certain strings can cause it to emit multiple STRING_INSIDE/MULTI_STRING_INSIDE tokens in sequence, but this is acceptable since the parser will eat it all anyway.

A lot of it came from reading the C# interpolated strings in the antlr4 examples permalink

Antlr4 DM string lexer rules

1 Answers