I have a perl storage file, which (when dumper with Dumper) has these strings in it:
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 = "2 = educa\x{e7}\x{e3}o";
I've been trying to work out a reasonable strategy, to output UTF8 (see also perl Encode::Guess with and without hints - detecting utf8).
Let me continue with the perl code above and get some declarations out the way:
use 5.18.2;
use Encode qw( encode_utf8 decode_utf8 from_to encode decode);
use Encode::Guess;
use Encoding::FixLatin qw(fix_latin);
sub sayStrings() {
say fixEnc($_[0]);
say fixEnc($_[1],'hint');
say "";
};
sub fixEnc() {
my $data = $_[0];
my $enc = "";
if ($_[1]) {
$enc = guess_encoding($data, qw/utf8 latin-1/);
} else {
$enc = guess_encoding($data);
};
if (!ref($enc)) {
return "ERROR: Can't guess: $enc for $data";
} else {
my $flag1a = utf8::is_utf8($data);
my $flag2a = utf8::valid($data);
$data .= "; encoding: ".$enc->name.", is_utf8=$flag1a, valid=$flag2a";
return $data;
};
};
Now for the questions! I am going to complement that code with various snippets.
say "Question 1";
&sayStrings($str1, $str2);
and
use open IO => ':encoding(UTF-8)';
say "raw";
&sayStrings($str1, $str2);
both give:
Question 1
1 = educação; encoding: utf8, is_utf8=, valid=1
2 = educa??o; encoding: iso-8859-1, is_utf8=, valid=1
Question 1A: What does the use open IO => ':encoding(UTF-8)';
not do anything? I guess my system is set up as UTF8 already. Correct?
Question 1B: Why do the characters in 2 not display correctly? The encoding is detected correctly, but maybe when the string is output at UTF, the 'çã' become UTF characters the system doesn't know about (or that don't exist)?
Now for question 2:
use open IO => ':encoding(UTF-8)',':std';
say "Question 2";
&sayStrings($str1, $str2);
gives:
Question 2
1 = educação; encoding: utf8, is_utf8=, valid=1
2 = educação; encoding: iso-8859-1, is_utf8=, valid=1
Question 2: Why does this make the latin-1 string display correctly, but break the UTF8 string? (I.e. it seems that by adding :std, the character sequence in str1 is interpreted as latin-1, not UFT8, see perl Encode::Guess with and without hints - detecting utf8). Why is that?
Question 3:
use open IO => ':encoding(UTF-8)',':std';
say "fix_latin";
&sayStrings(&fix_latin($str1), &fix_latin($str2));
gives
fix_latin
1 = educação; encoding: utf8, is_utf8=1, valid=1
2 = educação; encoding: utf8, is_utf8=1, valid=1
Question 3: I guess fix_latin indicates that the string is utf8, and so the string prints correctly. So there's obviously something I'm not understanding about sign-posting the string as utf8 and binmode. What is it?
Many thanks!
(P.S. Have tried to read the docs on this, but yes, please do send links that will explains this - ideally in clear language with plenty of examples...)