How do I find the length of a Unicode string in Perl?

Question

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.

use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';

print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";

The output of this script, however, disagrees with the manpage:

ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35

It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?

Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.

I don't see why you say that length() handles unicode strings properly. In your example length() gives the same result as bytes::length(), that is the number of bytes, not the number of characters (which would be proper). — Inshallah
In other words, length($unicode) is interpreting the string as ASCII, not as unicode. — Inshallah
You're absolutely correct! I had completely overlooked this fact—in my program, I'm using length() to set the Content-Length header in an HTTP message, which needs to be in bytes. After reading the length() docs, I was expecting that function to return something incorrect, but it is in fact exactly what I want when Perl is in use bytes mode: the length of the Unicode string in bytes, rather than characters. — Drew Stephens
Why do you want the length of a Unicode string? What are you using it for? — brian d foy

Inshallah Inshallah · Accepted Answer · 2009-08-25T07:48:37

If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.

$ascii = 'Lorem ipsum dolor sit amet';
{
    use utf8;
    $unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';

no bytes; # default, can be omitted
print "Character semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

print "----\n";

use bytes;
print "Byte semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

This outputs:

Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35

How do I find the length of a Unicode string in Perl?

4 Answers