1
votes

I'm new to perl, and I'm trying to print out the folderName from mork files (from Thunderbird).

From: https://github.com/KevinGoodsell/mork-converter/blob/master/doc/mork-format.txt

The second type of special character sequence is a dollar sign followed by two hexadecimal digits which give the value of the replacement byte. This is often used for bytes that are non-printable as ASCII characters, especially in UTF-16 text. For example, a string with the Unicode snowman character (U+2603):

☃snowman☃

may be represented as UTF-16 text in an Alias this way:

<(83=$03$26s$00n$00o$00w$00m$00a$00n$00$03$26)>

From all the Thunderbird files I've seen it's actually encoded in UTF-8 (2 to 4 bytes).

The following characters need to be escaped (with \) within the string to be used literally: $, ) and \

Example: aaa\$AA$C3$B1b$E2$98$BA$C3$AD\\x08 should print aaa$AAñb☺í\x08

$C3$B1 is ñ; $E2$98$BA is ; $C3$ADis í

I tried using the regex to replaced unescaped $ into \x

my $unescaped = qr/(?<!\\)(?:(\\\\)*)/;
$folder =~ s/$unescaped\$/\\x/g;
$folder =~ s/\\([\\$)])/$1/g;   # unescape "\ $ ("

Within perl it just prints the literal string.

My workaround is feeding it into bash's printf and it succeeds... unless there's a literal "\x" in the string

$ folder=$(printf "$(mork.pl 8777646a.msf)")
$ echo "$folder"
  aaa$AAñb☺í

Questions i consulted:

Convert UTF-8 character sequence to real UTF-8 bytes But it seems it interprets every byte by itself, not in groups.

In Perl, how can I convert an array of bytes to a Unicode string? I don't know how to apply this solution to my use case.

Is there any way to achieve this in perl?

1
If you omitted the last unescape substitution, you could simply have used eval "\"$folder\"".. But using eval is in general not safe, so better to use e.g. String::EscapeHåkon Hægland

1 Answers

1
votes

The following substitution seems to work for your input:

s/\\([\$\\])|\$(..)/$2 ? chr hex $2 : $1/ge;

Capture \$ or \\, if matched, replace them with $ or \. Otherwise, capture $.. and convert to the corresponding byte.

If you want to work with the result in Perl, don't forget to decode it from UTF-8.

$chars = decode('UTF-8', $bytes);