How to represent special characters using multiple ASCII characters

Question

I am trying to represent special characters such as CR, LF, NULL and etc. with there respective multi character ASCII representations \r\n\0.

Basically I want to write a string variable containing these special characters to an ASCII log text file in such a way that would allow me to copy the text from this file, paste it into visual studio to receive the same string variable that was written.

I imagine the best way to do it would be to write special characters in the same format used by visual studio code editor. (Please enlighten me on how the string format is called).

Sample code:

string mystring = "\r\n\0\0\u0001\u0018\0\0\u0001\u000fXML";
Console.WriteLine(mystring);

So I want to convert mystring so that Console.WriteLine would output \r\n\0\0\u0001\u0018\0\0\u0001\u000fXML instead of:

Console is just an easy way to describe the problem. I will be printing my string in different ways so I need to conver mystring to a string that would print \r\n\0\0\u0001\u0018\0\0\u0001\u000fXML (and all other special characters in the same way).

"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" — Uwe Keim
What excatly is it that you want to achive? This sound like an XY-Problem to me. — Ackdari
@Wyck The solution posted gets the job done. The only complaint I have is that it splits long strings with " + " — Laov

Wyck Wyck · Accepted Answer · 2020-05-12T13:55:23

Those are called escape sequences. You can consult the grammar to see which characters need to be escaped in a string literal. Basically you can escape any character with its Unicode character escape sequence.

\u hex_digit hex_digit hex_digit hex_digit

e.g.: replace U+000D with \u000d for a carriage return character.

If you want to keep the string short though, then there are some that don't need escaping. The ones that do need to be escaped are:

" (U+0022)
\ (U+005C)
Carriage return character (U+000D)
Line feed character (U+000A)
Next line character (U+0085)
Line separator character (U+2028)
Paragraph separator character (U+2029)

Everything else can be inserted literally.

If furthermore only want to allow ASCII encoding of your source file, then you can be even more restrictive about which characters to represent literally. You might want to be very restrictive.

Make yourself a function that decides if a character should be escaped or not. You might want to start with a function like:

public static bool IsSafeForLiteral(char ch) =>
    ch < 127
    && ch != '\u0022' // double quote
    && ch != '\u005c' // backslash
    && ch != '\u000d' // carriage return
    && ch != '\u000a' // line feed
    && (
        Char.IsLetterOrDigit(ch)
        || Char.IsPunctuation(ch)
        || Char.IsSymbol(ch)
        || (ch == ' ')
    );

Then use this test to construct a function that turns a string into C# source code for a string literal.

public static string ToSourceStringLiteral(string str)
{
    StringBuilder sb = new StringBuilder();
    sb.Append("\"");
    foreach (char c in str) {
        if (IsSafeForLiteral(c)) {
            sb.Append(c);
        } else {
            sb.AppendFormat(@"\u{0:X4}", (int)c);
        }
    }
    sb.Append("\"");
    return sb.ToString();
}

If you're really attached to the idea of a carriage return coming out as \r instead of \u000d then you'll furthermore have to program all those escape sequences.

One way is to make a dictionary of characters to replacements and apply that as well.

public static Dictionary<char, string> CSharpSpecialEscapes = new Dictionary<char, string>() {
    { '\u0000', @"\0" },
    { '\u0007', @"\a" },
    { '\u0008', @"\b" },
    { '\u0009', @"\t" },
    { '\u000a', @"\n" },
    { '\u000b', @"\v" },
    { '\u000c', @"\f" },
    { '\u000d', @"\r" },
    { '\u001b', @"\e" },
    { '\u005c', @"\\" }
};

public static string ToSourceStringLiteral(this string str)
{
    StringBuilder sb = new StringBuilder();
    sb.Append("\"");
    foreach (char c in str) {
        if (CSharpSpecialEscapes.TryGetValue(c, out string replacement)) {
            sb.Append(replacement);
        } else if (IsSafeForLiteral(c)) {
            sb.Append(c);
        } else {
            sb.AppendFormat(@"\u{0:X4}", (int)c);
        }
    }
    sb.Append("\"");
    return sb.ToString();
}

Depending on performance requirements, you could also prepopulate an array with all replacements in the 0..127 range and just use that, although the source starts to look less maintainable at that point. I recommend what I wrote above because it is descriptive (matches how the string escape sequences are defined nicely, as opposed to being optimally efficient).

I also made this version add the quotes at the beginning and end. You can easily remove the lines that say sb.Append("\""); if you don't want them.

How to represent special characters using multiple ASCII characters

2 Answers