15
votes

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.

use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';

print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";

The output of this script, however, disagrees with the manpage:

ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35

It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?

Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.

4
I don't see why you say that length() handles unicode strings properly. In your example length() gives the same result as bytes::length(), that is the number of bytes, not the number of characters (which would be proper).Inshallah
In other words, length($unicode) is interpreting the string as ASCII, not as unicode.Inshallah
You're absolutely correct! I had completely overlooked this fact—in my program, I'm using length() to set the Content-Length header in an HTTP message, which needs to be in bytes. After reading the length() docs, I was expecting that function to return something incorrect, but it is in fact exactly what I want when Perl is in use bytes mode: the length of the Unicode string in bytes, rather than characters.Drew Stephens
Why do you want the length of a Unicode string? What are you using it for?brian d foy

4 Answers

24
votes

If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.

$ascii = 'Lorem ipsum dolor sit amet';
{
    use utf8;
    $unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';

no bytes; # default, can be omitted
print "Character semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

print "----\n";

use bytes;
print "Byte semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

This outputs:

Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35
4
votes

The purpose of the bytes pragma is to replace the length function (and several other string related functions) in the current scope. So every call to length in your program is a call to the length that bytes provides. This is more in line with what you were trying to do:

#!/usr/bin/perl

use strict;
use warnings;

sub bytes($) {
    use bytes;
    return length shift;
}

my $ascii = "foo"; #really UTF-8, but everything is in the ASCII range
my $utf8  = "\x{24d5}\x{24de}\x{24de}";

print "[$ascii] characters: ", length $ascii, "\n",
    "[$ascii] bytes     : ", bytes $ascii, "\n",
    "[$utf8] characters: ", length $utf8, "\n",
    "[$utf8] bytes     : ", bytes $utf8, "\n";

Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is &#x24d5 (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.

2
votes

I found that it is possible to use Encode module to influence how the length works.

if $string is utf8 encoded string.

Encode::_utf8_on($string); # the length function will show number of code points after this.

Encode::_utf8_off($string); # the length function will show number of bytes in the string after this.

0
votes

There’s a fair bit of problematic commentary here.

Perl doesn’t know—and doesn’t care—which strings are “Unicode” and which aren’t. All it knows is the code points that make up the string.

Peeking at Perl’s internal UTF8 flag indicates you likely have the wrong idea about Perl strings. A “UTF-8 encoded string”—that is, the result of an encode operation like utf8::encode—usually does NOT have that flag set, for example.

There are some interfaces where that abstraction leaks, and strings with the internal UTF8 flag set DO behave differently from the same set of code points without that flag (that is, after utf8::downgrade). It’s unwise to rely on these behaviours since Perl’s own maintainers regard them as bugs. Most are fixed by the “unicode_strings” and “unicode_eval” features, and the rest by Sys::Binmode from CPAN.