Those are called escape sequences. You can consult the grammar to see which characters need to be escaped in a string literal. Basically you can escape any character with its Unicode character escape sequence.
\u
hex_digit hex_digit hex_digit hex_digit
e.g.: replace U+000D with \u000d
for a carriage return character.
If you want to keep the string short though, then there are some that don't need escaping. The ones that do need to be escaped are:
"
(U+0022)
\
(U+005C)
- Carriage return character (U+000D)
- Line feed character (U+000A)
- Next line character (U+0085)
- Line separator character (U+2028)
- Paragraph separator character (U+2029)
Everything else can be inserted literally.
If furthermore only want to allow ASCII encoding of your source file, then you can be even more restrictive about which characters to represent literally. You might want to be very restrictive.
Make yourself a function that decides if a character should be escaped or not. You might want to start with a function like:
public static bool IsSafeForLiteral(char ch) =>
ch < 127
&& ch != '\u0022' // double quote
&& ch != '\u005c' // backslash
&& ch != '\u000d' // carriage return
&& ch != '\u000a' // line feed
&& (
Char.IsLetterOrDigit(ch)
|| Char.IsPunctuation(ch)
|| Char.IsSymbol(ch)
|| (ch == ' ')
);
Then use this test to construct a function that turns a string into C# source code for a string literal.
public static string ToSourceStringLiteral(string str)
{
StringBuilder sb = new StringBuilder();
sb.Append("\"");
foreach (char c in str) {
if (IsSafeForLiteral(c)) {
sb.Append(c);
} else {
sb.AppendFormat(@"\u{0:X4}", (int)c);
}
}
sb.Append("\"");
return sb.ToString();
}
If you're really attached to the idea of a carriage return coming out as \r
instead of \u000d
then you'll furthermore have to program all those escape sequences.
One way is to make a dictionary of characters to replacements and apply that as well.
public static Dictionary<char, string> CSharpSpecialEscapes = new Dictionary<char, string>() {
{ '\u0000', @"\0" },
{ '\u0007', @"\a" },
{ '\u0008', @"\b" },
{ '\u0009', @"\t" },
{ '\u000a', @"\n" },
{ '\u000b', @"\v" },
{ '\u000c', @"\f" },
{ '\u000d', @"\r" },
{ '\u001b', @"\e" },
{ '\u005c', @"\\" }
};
public static string ToSourceStringLiteral(this string str)
{
StringBuilder sb = new StringBuilder();
sb.Append("\"");
foreach (char c in str) {
if (CSharpSpecialEscapes.TryGetValue(c, out string replacement)) {
sb.Append(replacement);
} else if (IsSafeForLiteral(c)) {
sb.Append(c);
} else {
sb.AppendFormat(@"\u{0:X4}", (int)c);
}
}
sb.Append("\"");
return sb.ToString();
}
Depending on performance requirements, you could also prepopulate an array with all replacements in the 0..127 range and just use that, although the source starts to look less maintainable at that point. I recommend what I wrote above because it is descriptive (matches how the string escape sequences are defined nicely, as opposed to being optimally efficient).
I also made this version add the quotes at the beginning and end. You can easily remove the lines that say sb.Append("\"");
if you don't want them.
" + "
– Laov