Perl cgi and XML::Code double encoding issue

Question

I am using XML::Code to create some XML Data from a GET parameter received through the CGI module. The webserver is Apache with charset set to UTF-8 and the submitting form is on a page with a

<!DOCTYPE html>
<html lang="en-GB">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

header. The CGI looks like this:

use CGI;
use Encode;
use XML::Code;
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");

my $cgi = CGI->new();
print $cgi->header(-type => "text/xml", -charset => "utf-8");
my $object = $cgi->param("object");
$object = decode("utf-8", utf8::upgrade($object));

my $content = XML::Code->new("formdata");
$content->version ("1.0");
$content->encoding ("UTF-8");

my $sub_content = XML::Code->new("object");
$sub_content->set_text($object);
$content->add_child($sub_content);

$sub_content = XML::Code->new("isutf");
$sub_content->set_text(utf8::is_utf8($object));
$content->add_child($sub_content);

print $content->code();

When calling the cgi with http://mydomain.com/cgi-bin/formdata.pl?object=ö the output (as copied from firebug) is

<?xml version="1.0" encoding="UTF-8"?>
<formdata>
    <object>Ã¶</object>
    <isutf>1</isutf>
</formdata>

Removing binmode(STDOUT, ":utf8") from the CGI gives me what I am looking for

<?xml version="1.0" encoding="UTF-8"?>
<formdata>
    <object>ö</object>
    <isutf>1</isutf>
</formdata>

Now I know how to solve this issue, but I thought I would be safe when setting everything to UTF-8. If I am not it would mean a lot more testing. Is it a bug in the perl libraries or in my thinking?

Best, Marcus

chooban chooban · Accepted Answer · 2012-10-07T13:11:44

I think that the following line:

$object = decode("utf-8", utf8::upgrade($object));

might not be helping. The utf8::upgrade returns a number of octets, after doing an in-place operation on the string. If you leave it as:

$object = decode("utf-8", $object);

then you might have more understandable behaviour.

I think I've figured out a bit more with the help of this short script:

#! /usr/bin/perl -w
use Encode;
binmode( STDOUT, ":utf8" );
my $string = "\x{C3}\x{B6}";
print "$string\n";
my $decoded = decode( "UTF8", $string );
print "$decoded\n";

The output from that is:

Ã¶
ö

So here's what I believe is happening. The $string declaration above is what you're getting back from your call to cgi->param, that is it's two bytes that represent ö in UTF-8. When the script first prints it, Perl has no indication that this is in UTF-8, but knows that it must first convert it before printing (because of the binmode).

Perl's default behaviour is to assume that a stream of octets to be interpreted as a string are encoded as Latin-1. So it takes the first byte, C3, looks up what it is in Latin-1 and then prints the UTF-8 equivalent to STDOUT. Same for B6. You can double check the bytes on Wikipedia.

However, the call to decode will interpret the bytes as UTF-8 and create a new string which consists of the character ö. Don't think of strings as having an encoding; bytes coming in and going out need an encoding, but in your program, and once they've been correctly interpreted, then they're just strings.

Now that Perl has interpreted those bytes and converted to a string that's encoded using whatever internal encoding it wishes, when you next go to print it out then it knows to convert the character to UTF-8 and you get the correct output.

Hope that helps you debug the CGI.

Perl cgi and XML::Code double encoding issue

1 Answers