4
votes

I have an encoding issue in perl when trying to pull back global addresses from webpages using both LWP::Useragent and Encode for character encoding. I've tried googling solutions but nothing seems to work. I'm using Strawberry Perl 5.12.3.

As an example take the address page of the US embassy in Czech Republic (http://prague.usembassy.gov/contact.html). All I want is to pull back the address:

Address: Tržiště 15 118 01 Praha 1 - Malá Strana Czech Republic

Which firefox displays correctly using character encoding UTF-8 which is the same as the webpage header char-set. But when I try to use perl to pull this back and write it to a file the encoding looks messed up despite using decoded_content in Useragent or Encode::decode.

I've tried using regex on the data to check the error isn't when the data is printed (ie internally correct in perl) but the error seems to be in how perl handles the encoding.

Here's my code:

#!/usr/bin/perl

require Encode;
require LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;

my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";

# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";

$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}

print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";

my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;

$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;

# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";

# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";

# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";

# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";

# check for #-digit character in the strings (to guard against the error coming in the print statement) 
if ($content_not_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
    print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
}
if ($content_Endode_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}

close (OUTPUTFILE);
exit;

And here's the output to terminal:

CONTENT TYPE: UTF-8 UNDECODED CONTENT::
Tr├à┬╛išt├ä┬¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic ENCODE::DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR

And to the file (note this is slightly different to terminal but not correct). OK WOW- this is showing as correct in stack overflow but not in Bluefish, LibreOffice, Excel, Word or anything else on my computer. So the data is there just incorrectly encoded. I really don't get what's going on.

CONTENT TYPE: UTF-8 UNDECODED CONTENT::
TržištÄ 15
118 01 Praha 1 - Malá Strana
Czech Republic DECODED CONTENT::
Tržiště 15
118 01 Praha 1 - Malá Strana
Czech Republic ENCODE::DECODED CONTENT::
Tržiště 15
118 01 Praha 1 - Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tržiště 15
118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR

Any pointers how this can be made really appreciated.

Thanks, Ian/Montecristo

2

2 Answers

5
votes

The mistake is using regex to parse HTML. You lack decoding of HTML entities, at the least. You can do that manually, or leave it to a robust parser:

use strictures;
use Web::Query 'wq';
use autodie qw(:all);

open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text
2
votes
#!/usr/bin/env perl

use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open     qw(:std :utf8);

use LWP::Simple;
use HTML::Entities;

my $content = get 'http://prague.usembassy.gov/contact.html';

my ($address) = ($content =~  m{<p><b>Address(.*?)</p>});
decode_entities($address);

say $address;

From the command line:

C:\temp> uu > tt.txt

C:\temp> gvim tt.txt

and the following text is displayed in GVim (which is UTF8 mode):

</b>:<br />Tržiště 15<br />118 01 Praha 1 - Malá Strana<br />Czech Republic

See also Tom Christiansen's standard preamble.