How to use unicode in perl CGI param

Question

I have a Perl CGI script accepting unicode characters as one of the params.
The url is of the form

.../worker.pl?text="some_unicode_chars"&...

In the perl script, I pass the $text variable to a shell script:

system "a.sh \"$text\" out_put_file";

If I hardcode the text in the perl script, it works well. However, the output makes no sense when $text is got from web using CGI.

my $q = CGI->new;  
my $text = $q->param('text');

I suspect it's the encoding caused the problem. uft-8 caused me so many troubles. Anyone please help me？

DavidRR DavidRR · Accepted Answer · 2013-12-06T13:19:42

Perhaps this will help. From Perl Programming/Unicode UTF-8:

By default, CGI.pm does not decode your form parameters. You can use the -utf8 pragma, which will treat (and decode) all parameters as UTF-8 strings, but this will fail if you have any binary file upload fields. A better solution involves overriding the param method: (example follows)

[Wrong - see Correction] Here's documentation for the utf-8 pragma. Since uploading binary data does not appear to be a concern for you, use of the utf-8 pragma appears to be the most straightforward approach.

Correction: Per the comment from @Slaven, do not confuse the general Perl utf8 pragma with the -utf-8 pragma that has been defined for use with CGI.pm:

-utf8

This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care, as it will interfere with the processing of binary uploads. It is better to manually select which fields are expected to return utf-8 strings and convert them using code like this:

use Encode;
my $arg = decode utf8=>param('foo');

Follow Up: duleshi, you ask: But I still don't understand the differnce between decode in Encode and utf8::decode. How do the Encode and utf8 modules differ?

From the documentation for the utf8 pragma:

Note that this function does not handle arbitrary encodings. Therefore Encode is recommended for the general purposes; see also Encode.

Put another way, the Encode module works with many different encodings (including UTF-8), whereas the utf8 functions work only with the UTF-8 encoding.

Here is a Perl program that demonstrates the equivalence of the two approaches to encoding and decoding UTF-8. (Also see the live demo.)

#!/usr/bin/perl

use strict;
use warnings;
use utf8;  # allows 'ñ' to appear in the source code

use Encode;

my $word = "Español";  # the 'ñ' is permitted because of the 'use utf8' pragma

# Convert the string to its UTF-8 equivalent.
my $utf8_word = Encode::encode("UTF-8", $word);

# Use 'utf8::decode' to convert the string back to internal form.
my $word_again_via_utf8 = $utf8_word;
utf8::decode($word_again_via_utf8);  # converts in-place

# Use 'Encode::decode' to convert the string back to internal form.
my $word_again_via_Encode = Encode::decode("UTF-8", $utf8_word);

# Do the two conversion methods produce the same result?
# Prints 'Yes'.
print $word_again_via_utf8 eq $word_again_via_Encode ? "Yes\n" : "No\n";

# Do we get back the original internal string after converting both ways?
# Prints 'Yes'.
print $word eq $word_again_via_Encode ? "Yes\n" : "No\n";

How to use unicode in perl CGI param

2 Answers