I'm trying to represent the BYOND DM language strings in lexer form (See http://byond.com and http://byond.com/docs/ref). Here are the rules for strings:
- The string start and end with double quotes. i.e.
"hello world"
evaluates tohello world
- A backslash acts as an escape character, which can escape the end quote. i.e.
"hello\"world"
evaluates tohello"world
- Newlines in the string can be ignored by ending the line with a backslash. i.e.
"hello\ world"
evaluates tohelloworld
- If the string opens/closes with the sequence
{"
/"}
respectively, newlines are allowed and entered into the final string. The sequence\\\n
is still ignored - The string can contain embedded expressions inside braces which are formatted into the result. Backslashes can escape the opening brace. i.e.
"hello [ "world" ] \["
evaluates tohello world [
at run-time. Any expression can go in the braces (calls, math, etc...) - If the starting quote/curly brace is prefixed with '@' escape sequences and embedded expressions are disabled for the string. i.e.
@{"hello [worl\d"}
and@"hello [worl\d"
both evaluate tohello [worl\d
I am trying to construct ANTLR4 .g4 lexer rules to tokenize these strings. I figure there's 4 (or more) token types I'd need:
- Normal string. i.e
"hello world"
,@"hello world"
,@{"hello world"}
or{"hello world"}
- String start before embedded expression. i.e.
"hello [
or{"hello [
- String end after embedded expression. i.e.
] world"
or] world"}
- String in between two embedded expressions. i.e.
] hello world [
Here are my (incomplete and unsuccessful) attempts:
LSTRING: '"' ('\\[' | ~[[\r\n])* '[';
RSTRING: ']' ('\\"' | ~["\r\n])* '"';
CSTRING: ']' ('\\[' | ~[[\r\n])* '[';
FSTRING: '"' ('\\"' | ~["\r\n])* '"';
If this can't be solved in the lexer, I can write the parser rules on my own with the tokens @
, {"
, "}
, [
, ]
, \\
, and "
. But, I figure I'd give this a shot since it'd be more performant.