My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I need to sanitize the output such that no invalid UTF-8 octets are emitted, otherwise the downstream consumer (Sphinx) will blow up.
At the very least I would like to know if the data is invalid so I can avoid passing it on; ideally I could remove just the offending bytes. However, enabling all the fatalisms I can find doesn't quite get me there with perl 5.12 (FWIW, use v5.12; use warnings qw( FATAL utf8 );
is in effect).
I'm specifically having trouble with the sequence "\xFE\xBF\xBE"
. If I create a file containing only these three bytes (perl -e 'print "\xEF\xBF\xBE"' > bad.txt
), trying to read the file with mode :encoding(UTF-8)
errors out with utf8 "\xFFFE" does not map to Unicode
, but only under 5.14.0. 5.12.3 and earlier are perfectly fine reading and later writing that sequence. I'm unsure where it's getting the \xFFFE
(illegal reverse-BOM) from, but at least having a complaint is consistent with Sphinx.
Unfortunately, decode_utf8("\xEF\xBF\xBE", 1)
causes no errors under 5.12 or 5.14. I'd prefer a detection method that didn't require an encoded I/O layer, as that will just leave me with an error message and no way to sanitize the raw octets.
I'm sure there are more sequences that I need to address, but just handling this one would be a start. So my questions are: can I reliably detect this kind of problem data with a perl before 5.14? What substitution routine can generally sanitize almost-UTF-8 into strict UTF-8?