If you want a string that consists of the three bytes E2 82 AC
, you can declare it like this:
my $bytes = "\xE2\x82\xAC";
The \xXX
form in a double quoted string uses two hex digits (and always two) to represent one byte.
The string above contains 3 bytes. If we pass the string to the length
function it will return 3:
say 'Length of $bytes is: ' . length($bytes); # 3
Perl has no way of knowing whether those three bytes are intended to represent the Euro symbol. They could equally be a three byte sequence from inside a JPEG file, or a ZIP file, or an SSL-encoded TCP data stream traversing a network. Perl doesn't know or care - it's just three bytes.
If you actually want a string of characters (rather than bytes) then you need to provide the character data in a way that allows Perl to use its internal representation of Unicode characters to store them in memory. One way is to provide the non-ASCII characters in UTF8 form in the source code. If you're doing this you'll need to say use utf8
at the top of your script to tell the Perl interpreter to treat non-ASCII string literals as utf8:
use utf8;
my $euro_1 = "€";
Alternatively you can use the form \x{X...} with 1-5 hex characters representing the Unicode codepoint number. This will declare an identical string:
my $euro_2 = "\x{20ac}";
Each of these strings contains a multi-byte representation of the euro character in Perl's internal encoding. Perl knows the strings are character strings so the length
function will return 1 (for 1 character) in each case:
say 'Length of $euro_1 is: ' . length($euro_1); # 1
say 'Length of $euro_2 is: ' . length($euro_2); # 1
The defining feature of Perl's internal representation of character strings is that it is for use inside Perl. If you want to write the data out to a file or a socket, you'll need to encode the character string to a sequence of bytes:
use Encode qw(encode);
say encode('UTF-8', $euro_1);
It's also possible to use binmode
or an argument to open
to say that any string written to a particular filehandle should be encoded to a specific encoding.
binmode(STDOUT, ':encoding(utf-8)');
say $euro_1;
This will only work correctly for character strings. If we took our original 3-byte string $bytes
and used either encode
or IO layers, we would end up with garbage, because Perl would take each byte and convert it to UTF8. So \xE2
would be output as \xC3\xA2
, \x82
would be output as \xC2\x82
and so on.
However, we can use the Encode::Decode
function to convert the 3-byte $bytes string into a single character string in Perl's internal character representation:
use Encode qw(decode);
my $bytes = "\xE2\x82\xAC";
my $euro_3 = decode($bytes);
say 'Length of $euro_3 is ' . length($euro_3); # 1
One minor nitpick: In your original question you stated that 20AC
is the UTF-16 representation of the euro symbol. In fact there are two different UTF-16 representations: UTF16BE and UTF16LE, with the latter using the opposite order: AC20
.
0xe282ac
in UTF-8. That is purely how it's encoded as a bytestring. – PhylogenesisEncode
for how to decode an UTF8 byte string into a Perl string – Håkon Hæglanduse utf-8
turns on utf-8 in the source, i.e. you can writeprint "I owe you 160€\n"
(you need to save the script as UTF-8). – choroba"\xe2"
is 100% equivalent to"\x{e2}"
. – ikegami