2
votes

Beginner in Perl.

I made a Perl script that parse data from html site. My script encodes the data in UTF-8, one of the data contains romanian characters, so encoding the data results in incorrect characters such as:

ţ = þ (incorrect); ş = º (incorrect); ă = ã (correct);

line example to parse from html:

Distribuţia: Robert Downey Jr. (Sherlock Holmes) Jude Law (Dr. John Watson) Rachel McAdams (Irene Adler) Mark Strong (Lord Blackwood) Kelly Reilly (Mary Morstan) Eddie Marsan (Inspectorul Lestrade) James Fox (Sir Thomas)

I want to split this with:

my ($credits, $line)
foreach $credits (split /(?=\w+:)\s*/, $line) {
...

but output, because "þ" is interpreted as "non-word character" (line breaks incorrectly here) is:

Distribuþ
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

Output wanted (correct):

Distribuţia
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

if I use "\p{Alpha}" variable instead of "\w", partially solve the problem (line breaks correctly, but displays the "Distribuþia" rather than "Distribuţia", probably happens with other character) look like this (incorrect):

Distribuþia
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)
4
Why are you using ASCII?SLaks

4 Answers

4
votes

Text::Unidecode

>perl -MText::Unidecode -E"say unidecode qq{rom\x{00E2}n\x{0103}}"
romana
3
votes

Just keep everything in utf-8.

If you want the Romanian 8-bit characters display correctly on your machine you will need to set your default environment to use the Romanian code page and ensure you have the correct fonts etc. to display these.

Much easier to leave everything as utf-8 and let the magic happen.

2
votes

þ is the Latin-1 character that has the same byte value as the Latin-10 character ț. It looks like you're not specifying the right character encoding when you read in the string. Presumably the web-page you're parsing is using Latin-10 but you're reading it into Perl without specifying any I/O encoding.

If this is the case, you should tell Perl the character encoding when opening the file:

open my $fh, '<:encoding(ISO-8859-16)', $file);

or if you don't have control over the file open and want to fix the string, you can convert it using:

use Encode;
Encode::decode('ISO-8859-16', $str);

Both approaches will convert the data into Perl's internal Unicode-aware string format, instead of Latin-1 bytes.

Note that you may need to also fix your output to encode the data as UTF-8 or Latin-10 depending on your needs.

0
votes

A simple y/// before decoding might do it (if your data and source are both in utf)

my $data = yadayada;
$data =~ tr/áéíóúçãõñ/aeioucaon/;

Show us some actual code :)