2
votes

How to deal with invalid UTF-8 sequences in data from external file / external command, which data is used to generate HTML (in a Perl web app)?

Currently I am running to_utf8() on each piece of data; said subroutine detects if data is invalid UTF-8, and falls back to 'latin1' encoding:

use utf8;
use Encoding;
binmode STDOUT, ':utf8';

sub to_utf8 {
    my $str = shift;
    return undef unless defined $str;
    if (utf8::valid($str)) {
        utf8::decode($str);
        return $str;
    } else {
        return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
    }
}

Please correct me if this code is incorrect.

The (fragment of) recommended setup in Perl Unicode Essentials from Tom Christiansen’s Materials for OSCON 2011 is

use utf8;
use open qw( :encoding(UTF-8) :std );

How to get something similar to what I have using something like above? I'd prefer automatic handling of Unicode, rather than having to remember to mark all output strings from external commands and files with to_utf8().

The data is from external files, or output from external commands, and it should be in UTF-8, but because of user errors it sometimes it isn't.

1
Maybe this answer provides some insight stackoverflow.com/questions/6234386/…matthias krull

1 Answers

3
votes

You can write a custom IO layer that does the "magical" decoding.

Usualy IO layers (like :utf8) are written in XS, but the core module PerlIO::via (see http://search.cpan.org/perldoc?PerlIO::via) allows you to use perl code for that.