Perl Japanese to English filename replacement

Question

I put together a perl script that works to replace Japanese file names to English file names. But there are still a couple of things that I don’t quite understand well.

I have the following configuration Client OS:

Windows XP Japan

Notepad++, installed

Server:

Red Hat Enterprise Linux Server release 6.2

Perl v5.10.1

VIM : VIM version 7.2.411

Xterm : ASTEC-X version 6.0

CSH: tcsh 6.17.00 (Astron)

The source of the files are Japanese .csv files generated on Windows. I saw posts about using utf8 and encoding conversion in Perl, and I hope to understand better why I didn’t need anything mentioned in the other threads.

Here is my script that worked? My questions are below.

#!/usr/bin/perl
my $work_dir = "/nas1_home4/fsomeguy/someplace";
opendir(DIR, $work_dir) or die "Cannot open directory";
my @files = readdir(DIR);
foreach (@files) 
{
    my $original_file = $_; 
    s/機/–machine_/; # replace 機 with -machine_
    my $new_file = $_;
    if ($new_file ne $original_file)
    {
        print "Rename " . $original_file . " to " . $new_file;
        rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or  print "Warning: rename failed because: $!\n";
    }
}

Questions:

1) Why isn’t utf8 required in this sample? In what type of examples would I need it. Use uft8; was discussed: use utf8 gives me 'Wide character in print')? But if I have added use utf8, then this script won’t work.

2) Why isn’t encoding manipulation required in this sample?
I actually wrote the script in Windows using Notepad++ (pasting in the Japanese characters from Windows XP Japan’s Explorer to my script). In Xterm, and VIM, the characters show up as garbage characters. But I didn’t have to deal with Encoding manipulation either, which was discussed here How can I convert japanese characters to unicode in Perl? .

Thanks.

UPDATES 1

Testing a simple localization sample in Perl for filename and file text replacement in Japanese

In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.

Script:

Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "\n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "\n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";   

    while (<IN1>)
    {
        print $_ . "\n";
        chomp;

        s/南/south/g;


        print OUT1 "$_\n";
    }

    close IN1;
    close OUT1; 
}

Result:

(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4) Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS Result: file name replacement failed.. Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68. \x93

(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3) Setup: Readdir encoding, utf8; file open encoding utf8 Result: file name replacement worked, south.txt generated But south1.txt file content replacement failed , it has the content \x93 (). Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25. ... -Ao?= (Bx{fffd}.txt

(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4) Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS Result: file name replacement worked, south.txt generated South1.txt file content replacement worked, it has the content south.

Conclusion:

I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS since the content of the csv file was SHIFT_JIS encoded.

I think the better question is, why wouldn't it work? If Notepad++ supports unicode, but your version of gvim and xterm don't, then why wouldn't it display the characters? Discussed here. — CornSmith
Built-in readdir/open etc. are broken w/r/t filesystem encodings. See section Unicode in Filenames in perltodo. To work around the problem, use Win32-Unicode or Path::Class::Unicode or PerlIO::fse or (low-level) Win32::FileOp/Encode::Locale. — daxim
CornSmith, I had no problem trying to get xterm to display the Japanese characters, but I also found that isn't a requirement to make this script work. My related question regarding LANG setting is posted here, stackoverflow.com/questions/17039705/…. — frank
daxim, I saw the Path::Class:Unicode example, along with other example I've been with unicode file processing, they usually uses a hex representation "\x{55ed}.txt"; What I really wanted to do, is be able to paste the strings from Windows explorer as is into my Perl script as is (so I don't have to look up the Unicode character code online for each replacement which can be tedious), and then run the Perl script on the Linux server side. It seems like my approach here would be the easiest for what I want to accomplish? — frank
@daxim, I have added UPDATES1 section to show a simplified example of what I am trying to do, and the different encoding combinations that I had to use. — frank

choroba choroba · Accepted Answer · 2013-06-11T09:06:40

Your script is totally unicode unaware. It treats all the strings as sequences of bytes. Fortunately, the bytes encoding the file names are identical to bytes encoding the Japanese characters used in the source. If you tell Perl to use utf8, it would interpret the Japanese characters in your script, but not the ones coming from the file system, so there will be no match.

Perl Japanese to English filename replacement

1 Answers