0
votes

I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:

  • The string start and end with double quotes. i.e. "hello world" evaluates to hello world
  • A backslash acts as an escape character, which can escape the end quote. i.e. "hello\"world" evaluates to hello"world
  • Newlines in the string can be ignored by ending the line with a backslash. i.e. "hello\ world" evaluates to helloworld
  • If the string opens/closes with the sequence {"/"} respectively, newlines are allowed and entered into the final string. The sequence \\\n is still ignored
  • The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e. "hello [ "world" ] \[" evaluates to hello world [ at run-time. Any expression can go in the braces (calls, math, etc...)
  • If the starting quote/curly brace is prefixed with '@' escape sequences and embedded expressions are disabled for the string. i.e. @{"hello [worl\d"} and @"hello [worl\d" both evaluate to hello [worl\d

I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:

  • Normal string. i.e "hello world", @"hello world", @{"hello world"} or {"hello world"}
  • String start before embedded expression. i.e. "hello [ or {"hello [
  • String end after embedded expression. i.e. ] world" or ] world"}
  • String in between two embedded expressions. i.e. ] hello world [

Here are my (incomplete and unsuccessful) attempts:

LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"'; 
CSTRING: ']' ('\\[' | ~[[\r\n])* '['; 
FSTRING: '"' ('\\"' | ~["\r\n])* '"';

If this can't be solved in the lexer, I can write the parser rules on my own with the tokens @, {", "}, [, ], \\, and ". But, I figure I'd give this a shot since it'd be more performant.

1

1 Answers

0
votes

I solved it with the following lexer tidbits. Permalink

...
@lexer::members
{
ulong regularAccessLevel;
System.Collections.Generic.Stack<bool> multiString = new System.Collections.Generic.Stack<bool>();
}
...
VERBATIUM_STRING: '@"' (~["\r\n])* '"';
MULTILINE_VERBATIUM_STRING: '@{"' (~'"')* '"}';
MULTI_STRING_START: '{"' { multiString.Push(true); } -> pushMode(INTERPOLATION_STRING);
STRING_START: '"' { multiString.Push(false); } -> pushMode(INTERPOLATION_STRING);
...
LBRACE: '[' { ++regularAccessLevel; };
RBRACE: ']' { if(regularAccessLevel > 0) --regularAccessLevel; else if(multiString.Count > 0) { PopMode(); } };
...
mode INTERPOLATION_STRING;
CHAR_INSIDE: '\\\''
    | '\\"'
    | '\\['
    | '\\\\'
    | '\\0'
    | '\\a'
    | '\\b'
    | '\\f'
    | '\\n'
    | '\\r'
    | '\\t'
    | '\\v'
    ;

EMBED_START: '[' -> pushMode(DEFAULT_MODE);
MULTI_STRING_CLOSE: {multiString.Peek()}? '"}' { multiString.Pop(); PopMode(); };
STRING_CLOSE: {!multiString.Peek()}? '"' { multiString.Pop(); PopMode(); };
STRING_INSIDE: {!multiString.Peek()}? ~('[' | '\\' | '"' | '\r' | '\n')+;
MULTI_STRING_INSIDE: {multiString.Peek()}? ~('[' | '\\' | '"')+;

Certain strings can cause it to emit multiple STRING_INSIDE/MULTI_STRING_INSIDE tokens in sequence, but this is acceptable since the parser will eat it all anyway.

A lot of it came from reading the C# interpolated strings in the antlr4 examples permalink