2
votes

Quoting the Perl Unicode FAQ "What if I don't decode?"

Whenever your encoded, binary string is used together with a text string, Perl will assume that your binary string was encoded with ISO-8859-1, also known as latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For example, if it was UTF-8, the individual bytes of multibyte characters are seen as separate characters, and then again converted to UTF-8. Such double encoding can be compared to double HTML encoding (>), or double URI encoding (%253E).

This silent implicit decoding is known as "upgrading". That may sound positive, but it's best to avoid it.

Disabling this implicit decoding would force the programmer to use decode()/encode() properly and help prevent bugs.

Is it possible to turn off implicit decoding? Ideally, using a binary string together with a text string would result in an error.

1
Detecting that kind of problem is possible, but Perl would need to start assigning meaning to strings: text vs bytes vs unknown. Then, if a func that expects text gets bytes, it could warn or die.ikegami

1 Answers

2
votes

I hate that passage. Perl never implicitly decodes strings using iso-8859-1. For starters, Perl has no way of knowing if the string has been decoded or not.

Consider the following:

my $num_apples = 4;
my $num_vegetables = $num_apples;

Did Perl implicitly convert a fruits into vegetables? No! Well, then why would you say it implicitly decoded using iso-8859-1 in the following?

my $bytes = "\x61\x62\x63\xE9";
$bytes =~ /♠/;

In the first snippet, you treated what were supposedly apples as vegetables. In the second snippet, you treated what were supposedly bytes as unicode code points.

If you have a function that expects a string of Unicode characters, and you pass

"\x61\x62\x63\xE9"

to it, it will be treated as "abcé" because Unicode code point 0x61 is "a", Unicode code point 0x62 is "b", etc. No decoding happens. Perhaps you got that string from using

decode('UTF-8', "\x61\x62\x63\xC3\xA9");

or

decode('iso-8859-1', "\x61\x62\x63\xE9");

but maybe you didn't use decode at all and simply started with

"\x61\x62\x63\xE9"

or

read($bin_fh, $buf, 4)

That doesn't mean that Perl implicitly decoded anything. Since no implicit decoding occurs, it's impossible to turn it off. The answer is no.