0
votes

The Euro character is 0xe282ac in UTF-8

I'm trying to use a string in perl with the UTF-8 character output to STDOUT.

So I set my script to be in UTF-8 with 'use utf8;'

And I set up my STDOUT to be in UTF-8 with 'binmode'.

An example script is:

use utf8;
binmode STDOUT, ':utf8';
print "I owe you 160\x{20ac}\n";
print "I owe you 80\xe2\x82\xac\n";  # UTF-8 encoding?

The \x{codepoint} works fine, but encoding the UTF-8 gives me an error:

I owe you 160€
I owe you 80â¬
3
The codepoint is not 0xe282ac in UTF-8. That is purely how it's encoded as a bytestring.Phylogenesis
Right - and can't I put that encoding in the string? If not, how do I specify UTF-8 characters in strings?David Ljung Madison Stellar
See Encode for how to decode an UTF8 byte string into a Perl stringHåkon Hægland
use utf-8 turns on utf-8 in the source, i.e. you can write print "I owe you 160€\n" (you need to save the script as UTF-8).choroba
@Phylogenesis, Re "I believe the braced version specifies the codepoint, not an exact bytestring.", There is no difference between the braced version and the non-braced version. "\xe2" is 100% equivalent to "\x{e2}".ikegami

3 Answers

5
votes

If you want a string that consists of the three bytes E2 82 AC, you can declare it like this:

my $bytes = "\xE2\x82\xAC";

The \xXX form in a double quoted string uses two hex digits (and always two) to represent one byte.

The string above contains 3 bytes. If we pass the string to the length function it will return 3:

say 'Length of $bytes is: ' . length($bytes);    # 3

Perl has no way of knowing whether those three bytes are intended to represent the Euro symbol. They could equally be a three byte sequence from inside a JPEG file, or a ZIP file, or an SSL-encoded TCP data stream traversing a network. Perl doesn't know or care - it's just three bytes.

If you actually want a string of characters (rather than bytes) then you need to provide the character data in a way that allows Perl to use its internal representation of Unicode characters to store them in memory. One way is to provide the non-ASCII characters in UTF8 form in the source code. If you're doing this you'll need to say use utf8 at the top of your script to tell the Perl interpreter to treat non-ASCII string literals as utf8:

use utf8;

my $euro_1 = "€";

Alternatively you can use the form \x{X...} with 1-5 hex characters representing the Unicode codepoint number. This will declare an identical string:

my $euro_2 = "\x{20ac}";

Each of these strings contains a multi-byte representation of the euro character in Perl's internal encoding. Perl knows the strings are character strings so the length function will return 1 (for 1 character) in each case:

say 'Length of $euro_1 is: ' . length($euro_1);    # 1
say 'Length of $euro_2 is: ' . length($euro_2);    # 1

The defining feature of Perl's internal representation of character strings is that it is for use inside Perl. If you want to write the data out to a file or a socket, you'll need to encode the character string to a sequence of bytes:

use Encode qw(encode);

say encode('UTF-8', $euro_1);

It's also possible to use binmode or an argument to open to say that any string written to a particular filehandle should be encoded to a specific encoding.

binmode(STDOUT, ':encoding(utf-8)');

say $euro_1;

This will only work correctly for character strings. If we took our original 3-byte string $bytes and used either encode or IO layers, we would end up with garbage, because Perl would take each byte and convert it to UTF8. So \xE2 would be output as \xC3\xA2, \x82 would be output as \xC2\x82 and so on.

However, we can use the Encode::Decode function to convert the 3-byte $bytes string into a single character string in Perl's internal character representation:

use Encode qw(decode);

my $bytes = "\xE2\x82\xAC";
my $euro_3 = decode($bytes);

say 'Length of $euro_3 is ' . length($euro_3);    # 1

One minor nitpick: In your original question you stated that 20AC is the UTF-16 representation of the euro symbol. In fact there are two different UTF-16 representations: UTF16BE and UTF16LE, with the latter using the opposite order: AC20.

3
votes

As the fileformat.info page that you link to described, the Unicode EURO SIGN character is at code point 20AC and may be referred to as U+20AC. In UTF-8 that is encoded as the three bytes 0xE2 0x82 0xAC

To add the Unicode character to a string, you may write

"I owe you \x{20ac}160\n"

or

"I owe you \N{EURO SIGN}160\n"

or

"I owe you \N{U+20AC}160\n"

or, if you use utf8 at the top of your program, you may add the literal character with the same effect

"I owe you €160\n"

each of these will add a single character to the string with the required code point

If you use

"I owe you 80\xe2\x82\xac\n"

then you have created a string with three characters which correspond to the UTF-8-encoded EURO SIGN character, which is a very different thing. You may use decode_utf8 from the Encode module to convert those bytes to a single character, but otherwise you have a UTF-8-encoded string, which is a different thing from a character string

Here's an example program

use strict;
use warnings 'all';

use open qw/ :std :encoding(UTF-8) /;

use Encode qw/ decode_utf8 :fallbacks /;

for my $s (
        "I owe you \x{20ac}160\n",
        "I owe you \N{EURO SIGN}160\n",
        "I owe you \N{U+20AC}160\n",
        do { use utf8; "I owe you €160\n" },
        decode_utf8(my $ss = "I owe you \xe2\x82\xac160\n") ) {

    print $s;
}

output

I owe you €160
I owe you €160
I owe you €160
I owe you €160
I owe you €160

Note that there is no need for use utf8 unless you are using non-ASCII characters in the source code, such as . You may access characters by their Unicode names (which are always in ASCII) as shown above

If I redirect to a file, I can see that it's encoding the first Euro symbol as expected, 0xe282ac, but the second is becoming 0xc3a2c20x82c2ac, so somehow it's getting garbled, as if it's being encoded twice.

It is being encoded twice. You encode the character yourself the first time by supplying the UTF-8-encoding "\xe2\x82\xac" for the character, and the binmode on your output file handle encodes each of those characters a second time, giving C3 A2 for E2, C2 82 for 82 and C2 AC for AC

3
votes

You are building two different strings, so getting different results shouldn't be surprising.

You are performing what is called "double-encoding". You had a string that was already encoded using UTF-8, and you asked Perl (using binmode and print) to encode it a second time. That was a bug on your part.


The string literal "\x{20ac}" produces a one-character string (0x20ac).

$ perl -E'say length("\x{20ac}")'
1

When you print it to a handle with the :utf8 handle, you are instructing Perl to treat those characters as Unicode Code Points and encode them using UTF-8.

As requested, Perl prints the following encoded using UTF-8:
U+020AC EURO SIGN (€).

$ perl -E'binmode STDOUT, ":utf8"; print "\x{20ac}"' | od -t x1
0000000 e2 82 ac
0000003

$ perl -E'binmode STDOUT, ":utf8"; say "\x{20ac}"'
€

The string literal "\xe2\x82\xac" produces a three-character string (0xe2, 0x82, 0xac).

$ perl -E'say length("\xe2\x82\xac")'
3

("\xe2\x82\xac" is the same thing as "\x{e2}\x{82}\x{ac}".)

When you print it to a handle with the :utf8 handle, you are instructing Perl to treat those characters as Unicode Code Points and encode them using UTF-8.

As requested, Perl prints the following encoded using UTF-8:
U+000E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (â),
U+00082 BREAK PERMITTED HERE and
U+000AC NOT SIGN (¬).

$ perl -E'binmode STDOUT, ":utf8"; print "\xe2\x82\xac"' | od -t x1
0000000 c3 a2 c2 82 c2 ac
0000006

$ perl -E'binmode STDOUT, ":utf8"; say "\xe2\x82\xac"'
�