Before anyone will tells me to RTFM, I must say - I have digged through:
- Why does modern Perl avoid UTF-8 by default?
- Checklist for going the Unicode way with Perl
- How to match string with diacritic in perl?
- How to make "use My::defaults" with modern perl & utf8 defaults?
- and many others (like perluniintro and others) - but - sure, missed something
So, the basic code:
use 5.014; #getting 'unicode_strings' feature
use uni::perl; #turning on many utf8 things
use Unicode::Normalize qw(NFD NFC);
use warnings;
while(<>) {
chomp;
my $data = NFD($_);
say "OK" if utf8::is_utf8($data);
}
At this point, from the utf8 encoded STDIN I got a correct unicode string in $data
, e.g. "\w" will match multibyte [\p{Alphabetic}\p{Decimal_Number}\p{Letter_Number}]
(maybe something more). That's ok and works.
AFAIK $data
does not contain utf8, but a string in perl's internal Unicode
format.
Now the questions:
- HOW can I ensure (test it), that any
$other_data
contains valid Unicode string? - For what purpose is the utf8::is_utf8($data)? The whole utf8 pragma is a mystery for me.
I understand that the use utf8;
is only for the purpose of telling Perl that my source code is in utf8 (so do similar things as when my script starts with BOM flag - for BigEndian) - from Perl's point of view, my source code is like an external file - and Perl should know in what encoding it is...
In the above example utf8::is_utf8($data)
will print OK - but I don't understand WHY.
Internally Perl does not use utf8, so my utf8 data-file is converted into Perl's internal Unicode, so why does the utf8::is_utf8($data)
return true for $data
, which is not in utf8 format? Or it is misnamed and the function should be named as uni::is_unicode($data)???
Thanks in advance for clarification.
Ps: @brian d foy - yes, I still don't have the Effective Perl Programming book - I will get it - I promise :) /joking/