How do I handle a Wide character in Perl using UTF-16?

Question

I am reading in a file (Little-endian UTF-16 Unicode text, with very long lines, with CRLF line), then I am doing some processing on that file, then I am using some data from the input file and outputting to a new file. I have tried many things from various questions and blog posts, and I will admit to being totally confused at this point. While writing this question, I was stuck on a BOM error, but upon suggestion from another question, I changed my "open" statement to include :encoding(UTF-16le), and now my error is "Wide character in subroutine entry", which I am also unable to resolve.

OS: Windows 10
Shell: cmd
Perl: This is perl 5, version 14, subversion 2 (v5.14.2) built for MSWin32-x86-multi-thread

I have tried with and without the layers (:encoding(UTF-16le):crlf) on both the input and the output. I have tried with and without the encode/decode. The results included BOM errors, the wide character error that I'm currently on, and an exported file that (when opened using Libre Office) shows what look like some kind of Asian characters when Imported with UTF-16 but looks more normal with UTF-8 (though still incorrect). The best I have been able to manage outputs a file that is mostly correct but includes a nonsense character in place of a correct accented character (c with a cedilla). Unfortunately, due to poor experimentation protocol, I no longer have that file nor the steps to reproduce it.

use strict;
use warnings;
use Encode qw(encode decode);

use POSIX 'strftime'; # because I like timestamps for lots of things

# removed :crlf per instructions
open(my $input_fh, '<:encoding(UTF-16le)', $path."/".$inputFile)
 or die "Could not open file "."'".$path."/".$inputFile." $!";

while (my $line = <$input_fh>) {
  #$line = decode ('UTF-16le', $line); # removed per instructions
  chomp $line;
  my @lineArray;
  my $last_char = "";
  my $current_char = "";
  my $current_string = "";
  my $field_count = 0;
  my $inside_quote = 0;

  for my $i (0..length($line)-1) {
    $last_char = $current_char;
    $current_char = substr($line, $i, 1);

    # Catch first char in the string?
    if ($current_char eq "," && $inside_quote == 0) { # if you find a comma and we're not inside quotes, it's a new field
      # put the whole string into the array as one field
      $lineArray[$field_count] = $current_string;
      $current_string = "";
      $field_count++;
    }
    elsif ($current_char eq '"' && $inside_quote == 0) { # found the first of two quotes
      $inside_quote = 1;
      # no need to update $current_string
      # no need to update $field_count
    }
    elsif ($current_char eq '"' && $inside_quote == 1) { # found a second quote, need to decide if it's in-field or an end quote
      $inside_quote++;
      $current_string .= '"';
      # no need to update $field_count
    }
    elsif ($current_char eq "," && $inside_quote >= 2) { # we are at the end of a string, but there was more than 1 quote
      # removes the trailing quote, if there was one
      if ($last_char eq '"') { $lineArray[$field_count] = chop($current_string); }
      else { $lineArray[$field_count] = $current_string; }
      $current_string = "";
      $field_count++;
      $inside_quote = 0;
    }
    else {
      $current_string .= $current_char;
    }
  } # for my $i (0..length($line)-1)
  my $id = $lineArray[0];
  my $name = $lineArray[1];
  my $campus = $lineArray[2];
  my $building = $lineArray[3];

  $output .= '"'.$id.'","'.$name.'","'.$campus.'","'.$building.'"'."\r\n";
}
my $output_fh;

# removed :crlf per instructions
open($output_fh, '>:encoding(UTF-16le)', $outputFileName) 
 or die "Could not open file '$outputFileName' $!";

#$output = encode ('UTF-16le', $output); #removed per instructions

print $output_fh $output;

Error: Wide character in subroutine entry at C:/Dwimperl/perl/lib/Encode.pm line 176, line 1.

I am hoping for a file that remains the same as the input (Little-endian UTF-16 Unicode text, with very long lines, with CRLF line) while maintaining "correct" special characters like the cedilla on the c. I am hitting a wall, and any help would be greatly appreciated.

Update (2019-01-14): Updated the code to include the "processing" and changes suggested by the commenters. My purpose is to process a csv file and output a few different files. I tried to use the csv-processing libraries, but couldn't get them to work because the input csv is not well-formed (and I can't control it). Therefore, I'm making the classic mistake of making my own parser. What you see above is the beginning of that parser. There are many other fields and many other actions to be taken on those fields (which is why I have stored them in nicely named variables rather than leaving them in hard-to-remember array spots). My thanks to everyone who has responded thus far. You are most definitely helping me past the wall.

Update 2 (2019-01-14): After uploading my code, I tried again, and I have more debugging info. First, my "testing" is trying to open the outputted file in LibreOffice Calc. As I noted, the UTF-16 import showed the Asian characters and the UTF-8 import looked more normal but still wrong (in this case some garbled characters and everything in one long line). HOWEVER, when I open the file in a text editor (like Atom), the file looks fine (except that every character has a space after it which I understand is to be expected with UTF-16).

SOLUTION (2019-01-14): The last comment by @ikegami was the solution. Leaving my code alone and adding the raw to the open input and the open output created a UTF-16 file that LibreOffice Calc can import correctly. Interestingly, running the "file" utility on the output file results in: "test.csv: data" which is not super encouraging. If anyone wants to try to answer why it isn't the same as the input file, I would love to know, but in any case, I will consider this question answered. Thanks to all who helped! I will try to figure out how to upvote you are whatever. Greatly appreciated! Also, any comments that tell me how to properly close this and/or properly reward those who helped are welcome.

You're getting that error because you're passing already-decoded text to decode. Remove the explicit encode and decode if you're going to use :encoding layers. — ikegami
Thanks! Removing those lines results in a file (when using UTF-16 import in Libre Office) with Asian-looking characters: 䜀攀漀戀 — RocketBouchard
what @ikegami said. plus, historically there have been issues using :crlf with ucs2/utf-16 encodings; maybe remove that and just make sure you end any added lines with "\r\n" (before encoding) — ysth
Thanks for responding! removing the crlf (from both input open and output open) and adding \r before my existing \n results in the same problem I reported to @ikegami — RocketBouchard
@RocketBouchard seems like the first step is just getting your data to roundtrip correctly, so your "do stuff" should just be $output .= $line; and verifying the output file is identical to the input file — ysth

tshiono tshiono · Accepted Answer · 2019-01-13T05:41:34

The following script works on my system (Ubuntu 18.04) at least.

use Encode qw(encode decode);
use utf8;

open(my $input_fh, '<:encoding(UTF-16le):crlf', $path."/".$inputFile)
 or die "Could not open file "."'".$path."/".$inputFile." $!";

while (my $line = <$input_fh>) {
  # some operations on the input text
  $line =~ s/フォルダー?/folder/g;
  $line =~ s/Windows/ウィンドウズ/g;
  $output .= $line;
}

open(my $output_fh, '>:encoding(UTF-16le):crlf', $outputFileName)
 or die "Could not open file '$outputFileName' $!";

print $output_fh $output;

I haven't tested the script on Windows10 but the input text is created on Windows encoded with UTF-16LE.
The script itself is encoded with UTF-8.

If you still have a problem, your providing the minimal set of input text and the processing stuff to reproduce the problem will be helpful.

How do I handle a Wide character in Perl using UTF-16?

1 Answers