perl Encode::Guess with and without hints - detecting utf8

Question

I am confused about Encode::Guess. Suppose this is my perl code:

use strict; 
use warnings;
use 5.18.2;
use Encode;
use Encode::Guess qw/utf8 iso-8859-1/;
use open IO => ':encoding(UTF-8)', ':std';
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 =  "2 = educa\x{e7}\x{e3}o";

say "A: ".&fixEnc($str1);
say "B: ".&fixEnc($str1,'hint');
say "C: ".&fixEnc($str2);
say "D: ".&fixEnc($str2,'hint');
say "";

sub fixEnc() {
    my $data = $_[0];
    my $enc = "";
    if ($_[1]) {
        $enc = guess_encoding($data,qw/utf8 iso-8859-1/);
    } else {
        $enc = guess_encoding($data);
    };
    if (!ref($enc)) {
        return "ERROR: Can't guess: $enc for $data";
    } else {
        my $utf8 = decode($enc->name, $data);
        $utf8 = "encoding guess: ".$enc->name."; result: $utf8";
        return $utf8;
    };
};

It produces:

A1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educaÃ§Ã£o
B2: ERROR: Can't guess: utf8 or iso-8859-1 for 1 = educaÃ§Ã£o
C1: encoding guess: iso-8859-1; result: 2 = educação
D1: encoding guess: iso-8859-1; result: 2 = educação

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by ' use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educaÃ§Ã£o
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
D2: encoding guess: iso-8859-1; result: 2 = educação

What causes the difference? In particular, why is utf8 not guessed when I hint with utf8?

Edit: I have posted an answer below. Basically, the realisation is that Guess goes by character encodings and doesn't speak Portuguese! 'educaÃ§Ã£o', while not Portuguese is the correct latin-1 version of string 1 above that Guess cannot distinguish from the UTF8 version educação (unlike a Portuguese speaker).

Amended code appropriately. Though for me that doesn't make any difference? — BBB

BBB BBB · Accepted Answer · 2018-11-05T07:49:44

I think this is what's going on. With use Encode::Guess qw/utf8 iso-8859-1/; the 'hint' makes no difference (sorry for being unclear!), so we only have

A1/B1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educaÃ§Ã£o

and C1/D1: encoding guess: iso-8859-1; result: 2 = educação

For A1/B2, the string could be UTF8 (educação) or it could be latin1 (educaÃ§Ã£o). The 2nd one looks incorrect, but Encode::Guess cannot tell - Guess goes by character encodings and doesn't speak Portuguese!

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by 'use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação

latin-1 is no longer an option (it's not part of the default), so the result comes out as utf8.

B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educaÃ§Ã£o

In B2, with the hit, we're back in the above scenario, and Guess cannot decide.

For C2:

C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação

this makes sense, as latin-1 isn't part of the default. Finally in D2

D2: encoding guess: iso-8859-1; result: 2 = educação

latin-1 is hinted, so the encoding is detected.

perl Encode::Guess with and without hints - detecting utf8

2 Answers