1
votes

My Perl script and data input file are in BIG5 Chinese encoding.

The string data contains HTML entity, eg. Japanese characters

The result displays perfectly when viewing on the browser.

But for further data manipulation, I need to convert them all into UTF-8

eg.

From BIG5 encoding

一と三

To UTF-8 encoding

一と三

Here's the code I've tried:

#!/usr/local/bin/perl

use Encode qw/encode decode/;
use HTML::Entities;

print "Content-type: text/html\n\n";

$str = "と";
$str = encode('utf8', decode("big5",$str));
print "$str\n";
decode_entities($str);
print "$str\n";

$str2 = "一と三";
$str2 = encode('utf8', decode("big5",$str2));
print "$str2\n";
decode_entities($str2); # where the issue is
print "$str2\n";

Here's the result after running the above code.

と
と
一と三
ä¸とä¸

Please note the script itself is also saved as BIG5 encoding.

After decode_entities($str2); it seems that it's trying to decode the Chinese characters in UTF-8 too, that's causing the issue.

How do I fix this issue? Or limit to the decode_entities() only applying to &xxxxx; pattern?

1

1 Answers

2
votes

The problem is that you mix decode_entities which outputs an utf8 string (utf8::is_utf8 returns true) with a raw string (utf8::is_utf8 returns false) consisting of an octet stream which could be interpreted as utf8. Instead you should combine either raw strings or utf8 strings.

The following works by first encoding your string from big5 to an utf8 string, then replacing the HTML encodings and then finally converting everything to a raw string representing utf8 characters:

$str2 = "一と三";
$str2 = decode("big5",$str2);  # big5 to internal utf8 -> utf8::is_utf8($str2) is true
decode_entities($str2);        # decode HTML entities
$str2 = encode('utf8',$str2);  # internal utf8 to raw bytes, utf8::is_utf8($str2) is false