3
votes

In a Webapp I maintain I try to keep everything in UTF-8:

  • the Database (CHARSET=utf8)
  • the source files (use utf8; written in utf8)
  • the templates (for Template Toolkit, using ENCODING => utf8)
  • user input and output (charset=utf8 header in HTTP, binmode :utf8 for STDIN and STDOUT)

But I still need to use Encode::decode('UTF-8',$data) for data coming from the database, or they will get double encoded or somehow broken.

Why is this? How can I get rid of this annoying extra step? Shouldn't there a way to just keep everything, everytime in UTF-8 without having to convert anything by hand?

3

3 Answers

3
votes

To get utf-8 properly from database you need on connection explicitly tell it:

my $dbh = DBI->connect( "dbi:mysql:dbname=$db;host=localhost",
       "user", "pwd", {mysql_enable_utf8 => 1 })

As i asked in my question here, there are still some problems with it, but in most cases it works fine.

To answer "why"-part is much harder. As Denis pointed, there was pretty heavy thread about "why" recently. Maybe it helps you understand related things. I suggest to use utf8::all` module to get utf-8 handling much easier and cleaner.

1
votes

Internally, your database will presumably keep all data in a fixed-with, raw format, usually UCS-4 (i.e. raw strings of 32-bit integers holding one codepoint each). UTF8 is an encoding, and encodings are only used when serializing data (e.g. in a file or database). Deserializing, i.e. reading, means to decode the encoded data and retrieve the raw codepoint string.

Just because you happen to use the same encoding for all your serialization needs doesn't prevent you from decoding when loading and encoding when writing.