1
votes

I am confused about Encode::Guess. Suppose this is my perl code:

use strict; 
use warnings;
use 5.18.2;
use Encode;
use Encode::Guess qw/utf8 iso-8859-1/;
use open IO => ':encoding(UTF-8)', ':std';
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 =  "2 = educa\x{e7}\x{e3}o";

say "A: ".&fixEnc($str1);
say "B: ".&fixEnc($str1,'hint');
say "C: ".&fixEnc($str2);
say "D: ".&fixEnc($str2,'hint');
say "";

sub fixEnc() {
    my $data = $_[0];
    my $enc = "";
    if ($_[1]) {
        $enc = guess_encoding($data,qw/utf8 iso-8859-1/);
    } else {
        $enc = guess_encoding($data);
    };
    if (!ref($enc)) {
        return "ERROR: Can't guess: $enc for $data";
    } else {
        my $utf8 = decode($enc->name, $data);
        $utf8 = "encoding guess: ".$enc->name."; result: $utf8";
        return $utf8;
    };
};

It produces:

A1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
B2: ERROR: Can't guess: utf8 or iso-8859-1 for 1 = educação
C1: encoding guess: iso-8859-1; result: 2 = educação
D1: encoding guess: iso-8859-1; result: 2 = educação

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by ' use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
D2: encoding guess: iso-8859-1; result: 2 = educação

What causes the difference? In particular, why is utf8 not guessed when I hint with utf8?

Edit: I have posted an answer below. Basically, the realisation is that Guess goes by character encodings and doesn't speak Portuguese! 'educação', while not Portuguese is the correct latin-1 version of string 1 above that Guess cannot distinguish from the UTF8 version educação (unlike a Portuguese speaker).

2
Always use strict; use warnings;!Biffen
Amended code appropriately. Though for me that doesn't make any difference?BBB

2 Answers

1
votes

I think this is what's going on. With use Encode::Guess qw/utf8 iso-8859-1/; the 'hint' makes no difference (sorry for being unclear!), so we only have

A1/B1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação

and C1/D1: encoding guess: iso-8859-1; result: 2 = educação

For A1/B2, the string could be UTF8 (educação) or it could be latin1 (educação). The 2nd one looks incorrect, but Encode::Guess cannot tell - Guess goes by character encodings and doesn't speak Portuguese!

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by 'use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação

latin-1 is no longer an option (it's not part of the default), so the result comes out as utf8.

B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação

In B2, with the hit, we're back in the above scenario, and Guess cannot decide.

For C2:

C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação

this makes sense, as latin-1 isn't part of the default. Finally in D2

D2: encoding guess: iso-8859-1; result: 2 = educação

latin-1 is hinted, so the encoding is detected.

0
votes

It's hard to say for sure because there are a few issues at work that make detecting the encoding difficult.

First is the fact that iso-8859-1 shares almost all of its code points with utf8. Unless there's a definitive byte-order mark at the start of the string or a character that doesn't exist in iso-8859-1, then Encode::Guess really is just guessing.

Second is mentioned in the Encode::Guess caveats in the perldocs. Encode::Guess runs through the text using a 'trial-and-error' algorithm to eliminate all but one of the provided encodings. Naturally the more alike to encodings are, the less accurate the module will be.

Third, when you don't specify the allowed encoding types in the use statement, the module will compare it to everything it can. This combined with the trial-and-error approach and the overlap in utf8 vs iso-8859-1 code points means it's possible for Encode::Guess to hit different conclusions based on the parameters passed to the method. I imagine you would get more consistent results if you checked against two more divergent encodings, like utf8 vs 7bit-jis.

Lastly, Perl has more than one implementation of utf8 so it's also possible that when you don't specify the 'utf8' encoding explicitly, it might be using a different implementation that may change the results as well. I don't know enough about Perl's internals to confirm that's what's happening in this case though.