2
votes

I have a Perl CGI script accepting unicode characters as one of the params.
The url is of the form

.../worker.pl?text="some_unicode_chars"&...

In the perl script, I pass the $text variable to a shell script:

system "a.sh \"$text\" out_put_file"; 

If I hardcode the text in the perl script, it works well. However, the output makes no sense when $text is got from web using CGI.

my $q = CGI->new;  
my $text = $q->param('text'); 

I suspect it's the encoding caused the problem. uft-8 caused me so many troubles. Anyone please help me?

2

2 Answers

3
votes

Perhaps this will help. From Perl Programming/Unicode UTF-8:

By default, CGI.pm does not decode your form parameters. You can use the -utf8 pragma, which will treat (and decode) all parameters as UTF-8 strings, but this will fail if you have any binary file upload fields. A better solution involves overriding the param method: (example follows)

[Wrong - see Correction] Here's documentation for the utf-8 pragma. Since uploading binary data does not appear to be a concern for you, use of the utf-8 pragma appears to be the most straightforward approach.

Correction: Per the comment from @Slaven, do not confuse the general Perl utf8 pragma with the -utf-8 pragma that has been defined for use with CGI.pm:

-utf8

This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care, as it will interfere with the processing of binary uploads. It is better to manually select which fields are expected to return utf-8 strings and convert them using code like this:

use Encode;
my $arg = decode utf8=>param('foo');

Follow Up: duleshi, you ask: But I still don't understand the differnce between decode in Encode and utf8::decode. How do the Encode and utf8 modules differ?

From the documentation for the utf8 pragma:

Note that this function does not handle arbitrary encodings. Therefore Encode is recommended for the general purposes; see also Encode.

Put another way, the Encode module works with many different encodings (including UTF-8), whereas the utf8 functions work only with the UTF-8 encoding.

Here is a Perl program that demonstrates the equivalence of the two approaches to encoding and decoding UTF-8. (Also see the live demo.)

#!/usr/bin/perl

use strict;
use warnings;
use utf8;  # allows 'ñ' to appear in the source code

use Encode;

my $word = "Español";  # the 'ñ' is permitted because of the 'use utf8' pragma

# Convert the string to its UTF-8 equivalent.
my $utf8_word = Encode::encode("UTF-8", $word);

# Use 'utf8::decode' to convert the string back to internal form.
my $word_again_via_utf8 = $utf8_word;
utf8::decode($word_again_via_utf8);  # converts in-place

# Use 'Encode::decode' to convert the string back to internal form.
my $word_again_via_Encode = Encode::decode("UTF-8", $utf8_word);

# Do the two conversion methods produce the same result?
# Prints 'Yes'.
print $word_again_via_utf8 eq $word_again_via_Encode ? "Yes\n" : "No\n";

# Do we get back the original internal string after converting both ways?
# Prints 'Yes'.
print $word eq $word_again_via_Encode ? "Yes\n" : "No\n";
1
votes

If you're passing UTF-8 data around in the parameters list, then you definitely want to be URI encoding them using the URI::Escape module. This will convert any extended characters to percent values which as easily printable and readable. On the receiving end you will then need to URI decode them before continuing.