perl - fixing mixed utf8 and latin encoding: use open IO vs. binmode

Question

I have a perl storage file, which (when dumper with Dumper) has these strings in it:

my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 =  "2 = educa\x{e7}\x{e3}o";

I've been trying to work out a reasonable strategy, to output UTF8 (see also perl Encode::Guess with and without hints - detecting utf8).

Let me continue with the perl code above and get some declarations out the way:

use 5.18.2;
use Encode qw( encode_utf8 decode_utf8 from_to encode decode);
use Encode::Guess;
use Encoding::FixLatin qw(fix_latin);

sub sayStrings() {
    say fixEnc($_[0]);
    say fixEnc($_[1],'hint');
    say "";
};

sub fixEnc() {
    my $data = $_[0];
    my $enc = "";
    if ($_[1]) {
        $enc = guess_encoding($data, qw/utf8 latin-1/);
    } else {
        $enc = guess_encoding($data);
    };
    if (!ref($enc)) {
        return "ERROR: Can't guess: $enc for $data";
    } else {
        my $flag1a = utf8::is_utf8($data);
        my $flag2a = utf8::valid($data);
        $data .= "; encoding: ".$enc->name.", is_utf8=$flag1a, valid=$flag2a";
        return $data;
    };
};

Now for the questions! I am going to complement that code with various snippets.

say "Question 1";
&sayStrings($str1, $str2);

and

use open IO => ':encoding(UTF-8)';
say "raw";
&sayStrings($str1, $str2);

both give:

Question 1
1 = educação; encoding: utf8, is_utf8=, valid=1
2 = educa??o; encoding: iso-8859-1, is_utf8=, valid=1

Question 1A: What does the use open IO => ':encoding(UTF-8)'; not do anything? I guess my system is set up as UTF8 already. Correct?

Question 1B: Why do the characters in 2 not display correctly? The encoding is detected correctly, but maybe when the string is output at UTF, the 'çã' become UTF characters the system doesn't know about (or that don't exist)?

Now for question 2:

use open IO => ':encoding(UTF-8)',':std';
say "Question 2";
&sayStrings($str1, $str2);

gives:

Question 2
1 = educaÃ§Ã£o; encoding: utf8, is_utf8=, valid=1
2 = educação; encoding: iso-8859-1, is_utf8=, valid=1

Question 2: Why does this make the latin-1 string display correctly, but break the UTF8 string? (I.e. it seems that by adding :std, the character sequence in str1 is interpreted as latin-1, not UFT8, see perl Encode::Guess with and without hints - detecting utf8). Why is that?

Question 3:

use open IO => ':encoding(UTF-8)',':std';
say "fix_latin";
&sayStrings(&fix_latin($str1), &fix_latin($str2));

gives

fix_latin
1 = educação; encoding: utf8, is_utf8=1, valid=1
2 = educação; encoding: utf8, is_utf8=1, valid=1

Question 3: I guess fix_latin indicates that the string is utf8, and so the string prints correctly. So there's obviously something I'm not understanding about sign-posting the string as utf8 and binmode. What is it?

Many thanks!

(P.S. Have tried to read the docs on this, but yes, please do send links that will explains this - ideally in clear language with plenty of examples...)

ikegami ikegami · Accepted Answer · 2018-11-05T11:55:08

First, you must realize that a $str2 can be viewed as a string encoded using iso-8859-1, and it's also a string of Unicode Code Points. That's because a string encoded using iso-8859-1 is no different than a string of Unicode Code Points. For example, decode('iso-8859-1', $str) produces $str. This means that providing a string encoded using iso-8859-1 to something expecting a string of Unicode Code Points will work, and providing a string of Unicode Code Points to something expecting a string encoded using iso-8859-1 will work (if all the Code Points are in the iso-8859-1 character set).

Question 1A: What does the use open IO => ':encoding(UTF-8)'; not do anything?

That sets the default layers for open. For example, it makes

open(my $fh, '>', $qfn)

equivalent to

open(my $fh, '>:encoding(UTF-8)', $qfn)

Since you don't use open without default layers —you don't use open at all— it has no effect.

Question 1B: Why do the characters in 2 not display correctly?

Your terminal expects UTF-8.

The string encoded using UTF-8 ($str1) consists of what the terminal expects, so it is therefore displayed correctly.

The string encoded using iso-8859-1 ($str2) doesn't consist of what the terminal expects, so it is therefore displayed incorrectly.

Question 2: Why does this make the latin-1 string display correctly, but break the UTF8 string?

You added a :encoding(UTF-8) layer to STDOUT, so strings printed to STDOUT are now expected to consist of Unicode Code Points, and they will encoded using UTF-8.

The string encoded using UTF-8 ($str1) doesn't consist of what print expects, so it is therefore mangled. (It ends up "double-encoded", to be specific.)

The string of Unicode Code Points ($str2) consists of what print expects, so it is therefore encoded correctly.

Question 3: I guess fix_latin indicates that the string is utf8, and so the string prints correctly.

The internal representation (as indicated by is_utf8) is irrelevant here (as it should be).

fix_latin("1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o") produced "1 = educa\x{e7}\x{e3}o".

fix_latin("2 = educa\x{e7}\x{e3}o") produced "2 = educa\x{e7}\x{e3}o".

perl - fixing mixed utf8 and latin encoding: use open IO vs. binmode

1 Answers