Perl , html data and characters encoded in utf-8

Question

Beginner in Perl.

I made a Perl script that parse data from html site. My script encodes the data in UTF-8, one of the data contains romanian characters, so encoding the data results in incorrect characters such as:

ţ = þ (incorrect); ş = º (incorrect); ă = ã (correct);

line example to parse from html:

Distribuţia: Robert Downey Jr. (Sherlock Holmes) Jude Law (Dr. John Watson) Rachel McAdams (Irene Adler) Mark Strong (Lord Blackwood) Kelly Reilly (Mary Morstan) Eddie Marsan (Inspectorul Lestrade) James Fox (Sir Thomas)

I want to split this with:

my ($credits, $line)
foreach $credits (split /(?=\w+:)\s*/, $line) {
...

but output, because "þ" is interpreted as "non-word character" (line breaks incorrectly here) is:

Distribuþ
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

Output wanted (correct):

Distribuţia
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

if I use "\p{Alpha}" variable instead of "\w", partially solve the problem (line breaks correctly, but displays the "Distribuþia" rather than "Distribuţia", probably happens with other character) look like this (incorrect):

Distribuþia
Robert Downey Jr. (Sherlock Holmes)
Jude Law (Dr. John Watson)
Rachel McAdams (Irene Adler)
Mark Strong (Lord Blackwood)
Kelly Reilly (Mary Morstan)
Eddie Marsan (Inspectorul Lestrade)
James Fox (Sir Thomas)

ikegami ikegami · Accepted Answer · 2011-09-13T03:59:37

4

votes

Text::Unidecode

>perl -MText::Unidecode -E"say unidecode qq{rom\x{00E2}n\x{0103}}"
romana

Perl , html data and characters encoded in utf-8

4 Answers