0
votes

I am being sent a csv file that is tab delimited. Here is a sample of what I see:

Invoice: Invoice Date   Account: Name   Bill To: First Name Bill To: Last Name  Bill To: Work Email Rate Plan Charge: Name  Subscription: Device Serial Number
2021-03-10  Test Company    Wally   Kolcz   [email protected]   Sample plan A0H1234567890A

I wrote a script to open, read and loop over the values but I get weird stuff after:

if (($handle = fopen($user_file, "r")) !== FALSE) {
            while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
                if($line >1 && isset($data[1])){
                    
                    $user = [
                        'EmailAddress' => $data[4],
                        'Name' => $data[2].' '.$data[3],
                    ];
                }

                $line++;
            }
            fclose($handle);
        }

Here is what I get when I dump the first line.

array:7 [▼
  0 => b"ÿþI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00 \x00I\x00n\x00v\x00o\x00i\x00c\x00e\x00 \x00D\x00a\x00t\x00e\x00"
  1 => "\x00A\x00c\x00c\x00o\x00u\x00n\x00t\x00:\x00 \x00N\x00a\x00m\x00e\x00"
  2 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00F\x00i\x00r\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
  3 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00L\x00a\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
  4 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00W\x00o\x00r\x00k\x00 \x00E\x00m\x00a\x00i\x00l\x00"
  5 => "\x00R\x00a\x00t\x00e\x00 \x00P\x00l\x00a\x00n\x00 \x00C\x00h\x00a\x00r\x00g\x00e\x00:\x00 \x00N\x00a\x00m\x00e\x00"
  6 => "\x00S\x00u\x00b\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n\x00:\x00 \x00D\x00e\x00v\x00i\x00c\x00e\x00 \x00S\x00e\x00r\x00i\x00a\x00l\x00 \x00N\x00u\x00m\x00b\x00e\x00r\x00 ◀"
]

I tried adding:

header('Content-Type: text/html; charset=UTF-8');
$data = array_map("utf8_encode", $data);
setlocale(LC_ALL, 'en_US.UTF-8');

And when I dump mb_detect_encoding($data[2]), I get 'ASCII'...

Any way to fix this so I don't have to manually update the file each time I receive it? Thanks!

3
The problem is not simply "file has a BOM" please do not close it as such.Sammitch

3 Answers

4
votes

Looks like the file is in UTF-16 (every other byte is null).

You probably need to convert the whole file with something like mb_convert_encoding($data, "UTF-8", "UTF-16");

But you can't really use fgetcsv() in that case…

4
votes

As @Andrea already mentioned, your data is encoded as UTF-16LE and you need to convert it to an encoding compatible with what you want to do. That said, it is possible to do in-flight with PHP stream filters.

abstract class TranslateCharset extends php_user_filter {

    protected $in_charset, $out_charset;
    private $buffer = '';
    private $total_consumed = 0;

    public function filter($in, $out, &$consumed, $closing) {
        $output = '';

        while ($bucket = stream_bucket_make_writeable($in)) {
            $input = $this->buffer . $bucket->data;
            for( $i=0, $p=0; ($c=mb_substr($input, $i, 1, $this->in_charset)) !== ""; ++$i, $p+=strlen($c) ) {
                $output .= mb_convert_encoding($c, $this->out_charset, $this->in_charset);
            }
            $this->buffer = substr($input, $p);
            $consumed += $p;
        }

        // this means that  there's unconverted data at the end of the bridage.
        if( $closing && strlen($this->buffer) > 0 ) {
            $this->raise_error( sprintf(
                "Likely encoding error at offset %d in input stream, subsequent data may be malformed or missing.",
                $this->total_consumed += $consumed)
            );
            $consumed += strlen($this->buffer);
            // give it the ol' college try
            $output .= mb_convert_encoding($this->buffer, $this->out_charset, $this->in_charset);
        }

        $this->total_consumed += $consumed;

        if ( ! isset($bucket) ) {
            $bucket = stream_bucket_new($this->stream, $output);
        } else {
            $bucket->data = $output;
        }
        stream_bucket_append($out, $bucket);
        return PSFS_PASS_ON;
    }

    protected function raise_error($message) {
        user_error( sprintf(
            "%s[%s]: %s",
            __CLASS__, get_class($this), $message
        ), E_USER_WARNING);
    }

}

class UTF16LEtoUTF8 extends TranslateCharset {
    protected $in_charset = 'UTF-16LE';
    protected $out_charset = 'UTF-8';
}

stream_filter_register('UTF16LEtoUTF8', 'UTF16LEtoUTF8');

// properly-encoded UTF-16BE example input "Invoice:,a"
$in = "\xFE\xFFI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00,\x00a\x00";

// prep example pipe, in practice this would simple be your fopen() call.
$fh = fopen('php://memory', 'rwb+');
fwrite($fh, $in);
rewind($fh);

// skip BOM
fseek($fh, 2);
stream_filter_append($fh, 'UTF16LEtoUTF8', STREAM_FILTER_READ);

var_dump(fgetcsv($fh, 4096));

Output:

array(2) {
  [0]=>
  string(8) "Invoice:"
  [1]=>
  string(1) "a"
}

In practice there is no "magic bullet" to detect the encoding of an input file or string. In this case there is a Byte Order Mark [BOM] of 0xFF 0xFE that denotes that this in UTF-16LE but the BOM is frequently omitted, or may simply occur naturally at the beginning of any arbitrary string, or is simply not required for most encodings, or is simply not used by whoever encoded the data.

That last bit is the exact reason why everyone should avoid the utf8_encode() and utf8_decode() functions like the plague, because they simply assume that you only ever want to go between UTF-8 and ISO-8859-1 [western european], and make no effort to avoid corrupting your data when used incorrectly because they can't possibly know any better.

TLDR: You must explicitly know the encoding of your input data, or you're going to have a bad time.

Edit: Since I've gone and put a proper spitshine on this I've put it up as a Composer package, in case anyone else needs something like this.

https://packagist.org/packages/wrossmann/costrenc

1
votes

I ended up with is as working code:

 $f = file_get_contents($user_file);        
  $f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');   
  $f = preg_split("/\R/", $f); 
  $f = array_map('str_getcsv', $f);
  $line = 0;


  foreach($f as $record){

    if($line !== 0 && isset($record[0])){
      $pieces = preg_split('/[\t]/',$record[0]);

      //My work here
    }
   }

Thank you everyone for your examples and suggestions!