7
votes

I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:

# outfile.txt is in GB-2312 encode    
open my $filter,"<",'c:/outfile.txt'; 

while(<$filter>){
#convert each line of outfile.txt to UTF-8 encoding   
    $_ = Encode::decode("gb2312", $_); 
...}

But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like

#outfile.txt is in GB-2312 encode
open my $filter,"<:utf8",'c:/outfile.txt'; 

(Perl says something like "utf8 "\xD4" does not map to Unicode" )

and

open my $filter,"<",'c:/outfile.txt'; 
$filter = Encode::decode("gb2312", $filter); 

(Perl says "readline() on unopened filehandle!)

They don't work. But is there some way to directly convert the input file to UTF-8 encode?

Update:

Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:

open my $filter,'<:encoding(gb2312)','c:/outfile.txt'; 
open my $filter_new, '+>:utf8', 'c:/outfile_new.txt'; 
print $filter_new $_ while <$filter>; 
while (<$filter_new>){
...
} 

But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.

2
Whenever you mention a warning message in a question, include the warning message in the question. :)brian d foy
@brian, thanks for the suggestion.Mike
It's best to use the exact warning message :) So, with that warning, you need to check the result of your open (which you should always do anyway).brian d foy
Too much work? That looks pretty straightforward and doable with a couple of lines of code. Wrap that in a subroutine and you're done. I'm not sure why you opened one file with '+>' though.brian d foy
well, you need to seek to the beginning if you want to read itbrian d foy

2 Answers

5
votes

I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.

When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.


old answer

The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.

You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.

If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.

4
votes

The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.

But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)

To do that, all you need is to apply an additional layer to your open:

open my $foo, "<:encoding(gb2312):bytes", ...;

Note that the output of the following will be the same:

perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'

but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.