I am confused about Encode::Guess. Suppose this is my perl code:
use strict;
use warnings;
use 5.18.2;
use Encode;
use Encode::Guess qw/utf8 iso-8859-1/;
use open IO => ':encoding(UTF-8)', ':std';
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 = "2 = educa\x{e7}\x{e3}o";
say "A: ".&fixEnc($str1);
say "B: ".&fixEnc($str1,'hint');
say "C: ".&fixEnc($str2);
say "D: ".&fixEnc($str2,'hint');
say "";
sub fixEnc() {
my $data = $_[0];
my $enc = "";
if ($_[1]) {
$enc = guess_encoding($data,qw/utf8 iso-8859-1/);
} else {
$enc = guess_encoding($data);
};
if (!ref($enc)) {
return "ERROR: Can't guess: $enc for $data";
} else {
my $utf8 = decode($enc->name, $data);
$utf8 = "encoding guess: ".$enc->name."; result: $utf8";
return $utf8;
};
};
It produces:
A1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
B2: ERROR: Can't guess: utf8 or iso-8859-1 for 1 = educação
C1: encoding guess: iso-8859-1; result: 2 = educação
D1: encoding guess: iso-8859-1; result: 2 = educação
Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by ' use Encode::Guess;' I get
A2: encoding guess: utf8; result: 1 = educação
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
D2: encoding guess: iso-8859-1; result: 2 = educação
What causes the difference? In particular, why is utf8 not guessed when I hint with utf8?
Edit: I have posted an answer below. Basically, the realisation is that Guess goes by character encodings and doesn't speak Portuguese! 'educação', while not Portuguese is the correct latin-1 version of string 1 above that Guess cannot distinguish from the UTF8 version educação (unlike a Portuguese speaker).
use strict; use warnings;
! – Biffen