Reading a utf8 encoded file after seek as in open(FILE, '<:utf8', $file) or die; seek(FILE, $readFrom, 0); read(FILE, $_, $size);
sometimes "breaks up" a unicode char so the beginning of the read string is not valid UTF-8.
If you then do e.g. s{^([^\n]*\r?\n)}{}i
to strip the incomplete first line, you get "Malformed UTF-8 character (fatal)" errors.
How to fix this?
One solution, listed in How do I sanitize invalid UTF-8 in Perl? is to remove all invalid UTF-8 chars:
tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;
However, to search the entire string seems like overkill, as it is only the first byte(s) in the read string that can be broken.
Can anyone suggest a way to strip only an initial invalid char (or make the above substitution not die on malformed UTF-8)?
tr
to only the first character? – Sobriqueperldoc -f read
Note the characters: ...By default all filehandles operate on on bytes, but...if the filehandle has been opened with the ":utf8" I/O layer the I/O will operate on UTF-8 encoded Unicode characters, not bytes Please give a minimal example of this happening – Vorsprungtr
strips out at least 29 valid characters!!!! – ikegami