There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".
tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.
This post aims to solve that first problem
When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str
, it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.
Marvelous.
Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length
to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }'
and type in 文字化け
and you get 12... not the correct answer, 4.
That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.
That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode
(or use Encode qw(decode)
, of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.
(There is also Encode::decode_utf8
, which assumes UTF-8 for the encoding.)
If we rewrite our one-liner:
perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'
We type in 文字化け and get "4" as the result. Success.
That, right there, is the solution to 99% of Unicode problems in Perl.
The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.
The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode)
, decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data)
instead of just outputting $data
.
This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.
To summarize: encode all outputs and decode all inputs.
Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content
, not content
, which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.
Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode
!
The second problem is UTF-8 encoded source code. Unless you say use utf8
at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = 'ほげ'
, you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.
The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "文字化け", we got 12. Perl assumed that we were operating on the Latin-1 string "æååã" (which is 12 characters, some of which are non-printing).
This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.
People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.
Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO
to each line, and you print it out. You write it like:
while(<>){
chomp;
say "$_ 💩";
}
And then run on some UTF-8 encoded data, like:
perl poo.pl input-data.txt
It prints the UTF-8 data with a poo at the end of each line. Perfect, my program works!
But nope, you're just doing binary concatenation. You're reading octets from the file, removing a \n
with chomp, and then tacking on the bytes in the UTF-8 representation of the PILE OF POO
character. When you revise your program to decode the data from the file and encode the output, you'll notice that you get garbage ("ð©") instead of the poo. This will lead you to believe that decoding the input file is the wrong thing to do. It's not.
The problem is that the poo is being implicitly upgraded as latin-1. If you use utf8
to make the literal text instead of binary, then it will work again!
(That's the number one problem I see when helping people with Unicode. They did part right and that broke their program. That's what's sad about undefined results: you can have a working program for a long time, but when you start to repair it, it breaks. Don't worry; if you are adding encode/decode statements to your program and it breaks, it just means you have more work to do. Next time, when you design with Unicode in mind from the beginning, it will be much easier!)
That's really all you need to know about Perl and Unicode. If you tell Perl what your data is, it has the best Unicode support among all popular programming languages. If you assume it will magically know what sort of text you are feeding it, though, then you're going to trash your data irrevocably. Just because your program works today on your UTF-8 terminal doesn't mean it will work tomorrow on a UTF-16 encoded file. So make it safe now, and save yourself the headache of trashing your users' data!
The easy part of handling Unicode is encoding output and decoding input. The hard part is finding all your input and output, and determining which encoding it is. But that's why you get the big bucks :)