How to deal with invalid UTF-8 sequences in data from external file / external command, which data is used to generate HTML (in a Perl web app)?
Currently I am running to_utf8()
on each piece of data; said subroutine detects if data is invalid UTF-8, and falls back to 'latin1' encoding:
use utf8;
use Encoding;
binmode STDOUT, ':utf8';
sub to_utf8 {
my $str = shift;
return undef unless defined $str;
if (utf8::valid($str)) {
utf8::decode($str);
return $str;
} else {
return decode($fallback_encoding, $str, Encode::FB_DEFAULT);
}
}
Please correct me if this code is incorrect.
The (fragment of) recommended setup in Perl Unicode Essentials from Tom Christiansen’s Materials for OSCON 2011 is
use utf8;
use open qw( :encoding(UTF-8) :std );
How to get something similar to what I have using something like above? I'd prefer automatic handling of Unicode, rather than having to remember to mark all output strings from external commands and files with to_utf8().
The data is from external files, or output from external commands, and it should be in UTF-8, but because of user errors it sometimes it isn't.